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Abstract 


Following  an  explanation  and  discussion  of  the  importance  of  voice  communications  for  military  operations,  including  the 
environmental  and  propagation  effects  and  ECM,  the  Lectures  will  outline: 

—  speech  coding  which  is  mainly  concerned  with  man-to-man  voice  communication 

—  speech  synthesis  which  deals  with  machine-to-man  communication 

—  speech  recognition  which  is  related  to  man-to-machine  communication. 

All  these  are  techniques  which  involve  speech  compression  or  speech  coding  at  low-bit  rates  and  are  needed  for  transmitting 
speech  messages  with  a  high  level  of  security  and  reliability  over  low  data-rate  channels  and  for  other  applications  such  as 
memory-efficient  systems  for  voice  storage  and  response. 

The  themes  above  will  be  underpinned  by  a  lecture  on  the  nature  of  the  speech  signal  (production,  recognition  and  perception) 
and  complemented  by  other  lectures  on  quality  assessment  of  speech  systems  and  standards  which  are  crucial  for  the 
satisfactory  deployment  of  speech  systems, 

This  Lecture  Series,  sponsored  by  the  Avionics  Panel  of  AGARD,  has  been  implemented  by  the  Consultant  and  Exchange 
Programme. 


R6sum6 


Suite  k  une  prdsentutlon-ddbat  sur  ('importance  dcs  liaisons  vocales  dans  les  operations  militaires,  y  compris  les  effets  de 
propagation  at  du  milieu  de  transmission,  lei  communications  tralteront  des  sujets  sulvants: 

—  le  codage  de  la  parole,  oti  les  liaisons  vocales  homme-homme 

—  la  synthfcie  de  la  parole,  ou  la  dialogue  machine-homme 

—  la  reconnaissance  de  la  parole,  ou  le  dialogue  hommn-machine 

Toulas  cei  techniquM,  qui  font  appel  k  la  compression  de  ia  parole  ou  au  codage  des  signaux  vocaux  k  faible  dibit  binaire, 
permanent  de  transmettre  du  messages  vocaux  sur  dM  voles  k  faible  dibit  tout  en  assurant  des  niveaux  de  sicuritl  de  liability 
dlevds.  Elies  se  pritont  aussl  it  d'autrei  applications  telles  quc  dM  systhmes  iconomlques  en  mdmoirc  pour  la  memorisation  de 
la  parole  et  la  riponse  vocaie. 

Lm  communications  dont  les  thirties  sont  e;  umerii  ci-dessus  seront  pricddees  par  une  presentation  sur  la  nature  du  signal  de 
conversation  (generation,  reconnaissance  et  perception)  compietie  par  d'autres  communications  sur  revaluation  de  la  qualiti 
des  systime*  de  liaisons  vocales  et  les  normes  qui  sont  indispcnsables  it  la  mise  en  oeuvre  de  ces  systemes  dans  de  bonnes 
conditions, 

Ce  cycle  da  conferences  Ml  prisenti  dans  le  cadre  du  programme  des  consultants  et  des  ichanges,  sous  rigide  du  Panel 
AGARD  d'AvInniquc. 
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OVERVIEW  OF  REQUIREMENTS  AND  NETWORKS 
FOR  VOICE  COMMUNICATIONS  AND  SPEECH  PROCESSING 


A.  NEJAT  1NCE  i 

Istanbul  Technical  University 
Ayazaga  Campus 
Istanbul 
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This  paper  starts  with  a  discussion  on  the  use  of  voice  for  military  and 
oivil  communications  and  continues  to  outline  the  military  operational  i 

requirements  in  relation  to  air  operations  including  the  effects  of  i 

propagatipnal  factors  and  electronic  warfare.  Structures  of  the  existing  NATO 

communications' network  and  the  evolving  Integrated  Services  Digital  Network 
(ISDN)  are  reviewed  to  show  how  they  meet  the  requirements. 

It  is  concludod  that  speech  coding  at  low-bit  rates  is  a  growing  need  for 
transmitting  speech  messages  with  a  high  level  of  security  and  reliability  1 

over  low  data-rate  channels  and  for  mamory-efficiant  systems  for  voice 
storage,  voice  response,  and  voice  mail  etc.  Furthermore  it  ie  pointed  out 
that  the  low-bit  rate  voice  coding  can  ease  the  transition  to  sharsd  ohannels 
for  voioa  and  data  and  can  readily  adopt  voice  massages  for  packet  switching. 

The  speech  proaesaing  techiques  end  systems  are  then  outlined  as  an 
introduction  to  tha  lectures  of  this  series  in  terms  ofi 

-  Tha  character  of  tha  apeaah  signal,  its  generation  and  perception 

-  speech  coding  which  is  mainly  concerned  with  man-to-m«n  voice 
communication 

-  speech  synthesis  which  daals  with  machine-to-man  communication 

-  speech  recognition  which  ie  related  to  man-to-maohine  communication 
and 

-  Quality  assessment  of  speech  system  and  standards 
1.  INTRODUCTION 


Although  there  are  many  shadas  of  opinion,  communication  is  broadly 
defined  to  be  the  establishment  of  social  unit  from  individuals,  by  the  use 
of  language  or  eigne  (1).  When  we  communicate,  one  with  another,  we  make 
sound*  with  our  vocal  organs,  or  scribe  different  shape*  of  ink  mark  on  paper 
(or  some  other  medium) ,  or  gesticulate  in  various  patterned  ways)  such 
physical  sign*  or  signal*  have  th*  ability  to  change  thoughts  and 
behaviour-they  are  the  medium  of  communication.  Telacommunlcationa  engineers 
have  aa  their  buaineaa  th*  extension  of  the  distance  over  which  the 
communication  prooee*  normally  takes  place  by  transmitting  such  signals  while 
preserving  their  forme  In  such  systems  as  telephones,  telegraphs,  facsimile, 
video. 


It  must  bo  noted  her*  that  th*  "social  unit"  that  is  NATO  in  our  case  is 
multilingual  and  multinational  with  all  that  these  imply  in  exchanging  or 
aharlng  information  whigh  make  it  different  from  a  more  homogenious  national 
environment.  Greater  care  must  therefor*  be  exercised  in  using  national 
rasulta  relating  to  speech  input/output  syateto*.  On*  feels  instinctively  that 
communications  in  NATO  would  somswhat  be  more  difficult,  complex,  lees 
accurate  and  longer,  thus  making  written  communications  more  important. 

There  are  two  distinot  classes  of  signal.  Thar*  are  signals  in  tints  such 
M  ■P**?J*.  9*  musici  and  there  are  signals  in  spacs,  like  print,  stona 
insoriptign,  pufichsd  cards,  and  plcturaa.  Out  of  all  thas*  communication 
forms,  fspaecrr,  is  ptfhaps,  th*  most  "natural*  mod#  by  which  human  being* 
communicate  with  each  other.  Thar*  are  also  good  reasons  for  people  wishing, 
th  uee  apaech  to  communicate  with  machines.  It  must,  however,  be  pointed  out 
that,  there  is  hot  much  empiric*!  evidence  to  show  th*  value  of  speech  over 
other  modes  of  communication. 

In  a  recent  study  cmrried  cut  by  th*, author  (j;j  it  was  established  that 
in  a  trl-aarnca  itbatmhlo  C it  anviJohrteht  about  half  of  the  total  traffic  in 
Erlangs  wa*  fir  void*  and  tha  rest  was  approximately  equally  divided  between 
data  and  moasaga  traffic.  In  an  information  theoretic  sense,  however,  th# 
bulk  of  communication  was  oarriad  by  tha  maataga  handling  ayatam.  About  70% 
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of  tha  traffic  waa  for  air  oparationa.  It  ia*  however.  axpaotad  that  thaaa 
proportiona  will  change  with  time  in  favour  of  tha  data  traffio.  Tha  traffio 
stituation  ia,  of  courts,  vary  diffarant  in  tha  civil  natworh  when,  at  laaat 
in  tha  foraaaaabla  future,  voioa  sarvica  will  oontinua  to  predominate  all 
others.  It  muat  be  atatad  however,  that  tha  main  reason  for  tha  prapondaranoa 
of  maaaaga  traffic  in  military  networks  today  ia  the  requirement  of 
■recording"  information  in  a  aacura  and  aaaily  accaaaibla  way  and  ability  to 
coordinate  and  diaaaminata  it. 

Notwithstanding  tha  above*  an  experiment  carried  out  at  Johns  HopKina 
Univarsity  (31  showed  that  teams  of  people  intaraoting  together  to  solve 
problems  solved  them  much  faster  using  voioa  than  any  other  mode  of 
communication.  There  are  other  studies  which  indicate  that  voice  provides 
advantages  over  other  means  of  communication  for  certain  applications >  There 
is  no  doubt  that  the  main  reason  for  tha  preference  or"  voice*  at  least  for 
certain  applications*  stems  from  it  being  "natural"*  not  requiring  any 
special  training  to  learn*  and  freeing  the  handa  and  eyes  for  other  tasks. 

The  features  of  speech  communications  that  era  disadvantageous  relate  to 
the  difficulty  of  keeping  permanent  secure  records*  intsrfsrenoa  cauaad  by 
oompating  environmental  acoustic  noise*  phyaioal/pyaehologioal  changes  in  the 
apaekar  causing  changes  in  speech  characteristics  or  disabilities  of 
■peaking/hearing  and  finally  its  atrial  and  informal  nature  leading  to  elower 
information  transfer  end/or  information  acoaee.  It  muat  be  pointed  out 
however  that  aoma  of  the  disadvantages  of  apaach  communication  are  dependent 
on  the  atate  of  technology  end  oan  therefore  change  with  time  end 
application. 

rig.i  shows  how  the  importance  of  the  communication  mode  ohengaa  with  the 
phaaea  of  an  engineering  project  (4).  Tha  importance  of  text  dominates  st  the 
beginning  and  end  of  an  engineering  development  procaes.  In  the  middle  of  tha 
prooaas*  other  forma  of  communication  mudss  rlsa  and  fall  in  importancs*  dus 
to  the  apeoialiaad  daalng  and  implementation  method*  of  engineering.  Orsphloe 
maintains  its  importance  throughout  the  procaes. 

from  tha  example  above  it  la  not  too  difficult  to  aea  a  oertaln  degree  of 
resemblance  between  tha  modae  of  communication  required  for  an  engineering 
devalopmant  project  and  thaaa  for  command  and  control)  all  modae  ara  required 
in  ganaral  with  preference  given  to  aoma  depending  on  application  and  tha 
development,  of  taahnologiaa  and  operation*!  oonoepte. 

h  WHTOMliSM  tUmtglttl 

Since  the  eubjaot  of  our  Lecture  Serial  1*  particularly  ralated  to  Air 
Oparationa  in  NATO  w*  should  now  take  a  brief  look  at  the  type  of 
communloationa  that  they  require  and  tha  type  of  environment  in  which  they 
are  to  work. 

Air  operations  involve  both  fixed  and  mobile  platform*  (land*  aea  and 
air)  and  communications  that  ara  required  to  interconnect  them  consist  ofi 

-  A  switched  terrestrial  network 

-  Mr/ground  communication*  and 

-  Intra-aircraft  (cockpit)  communications. 

Thai*  communications  srs  used  to  support i 

-  the  management  of  offensive  sir  operations 

-  tha  management  of  defensive  aircraft 

•  regional*  sub-regional  sir  defence  control  systems. 

In  addition  there  are  also  dedicated  communications  employed  for 
sustained  survell lenoe*  navigation  aid  Iff. 

Tha  main  air  warfare  missions  and  asaoolatad  rangts  together  with  tha 
types  of  oonwuniootiona  required  are  given  in  Pig  3.  Thai*  communications  art 
currently  provided  by  a  combination  of  NATO  and  national  network!  using  both 
terrestrial  and  satellite  links  together  with  VHP/UHP  ground/air*  alr/alr/air 
and  HP  radio  oommunioatlona  to  and  between  taotical/etratagir  aircraft 
irig.  3), 

The  terrestrial  transmission  systems  used  today  provide  nominally  4  kHs 
analogue  circuit*  even  though  tha  NATO  iatcom  systems  is  totsly  digitised  and 
sons  national  syetems  (PTT  and  military)  use  digital  transmission  links.  NATO 
also  owns  and  operate*  automatically  switched  voioa  and  telegraph  networks. 
It  is  to  bs  noted  that  e  significant  portion  of  the  traffio  that  flowe  in  the 
common-uaer  network  la  related  to  air  operations.  As  far  as  uhp/vhp  and  HP 
radios  are  concerned,  they  provide  analogue  voice  end  data  except  for  JTIDl 
(■NCI) /NIDI  which  is  totally  digital  and  is  currently  available  for  the  NATO 
AIN  program.  The  NATO  oommunioatlona  systems  carry  soma  circuits  which  arc 
cryptographically  secure  end-to-end  end  there  are  soma  links  and  circuits 
carried  by  IATCOM  and  JTIDI  which  are  protected  also  against  jamming. 
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2.1  Integrated  Services  Digital  Network  (ISDN) 

NATO  decided  in  1984  that  moat  of  the  NATO  terrestrial  communications 
requirements  would  be  met  in  the  future  by  the  strategic  military 
communications  networks  that  are  today  being  designed  and  some  being 
implemented  by  the  Member  Countries.  All  these  networks  largely  follow  the 
CCITT  IbN/ISDN  standards  and  recommendations  and  adopt  the  International 
Standards  Organisation' o  (ISO)  Open  System  Interconnection  Reference  Model 
(OSI/RM).  Those  digital  common-user  grid  networks  provide  mission  related, 
situation  orianted,  low-delay  "teleservices"  such  as  plain/secure  voice, 
facsimile  and  non-interaetiva  and  interactive  data  communications.  These  are 
enhanced  by  "supplementary  services"  such  as  priori  :y  and  pre-emption, 
■ecure/nor-secure  line  warning  as  well  as  closed-user  groups,  aall  forwarding 
and  othera.  The  switching  subsystem  supports  three  types  of  connection 
methodology  namely,  semi-switched  connect  ions,  circuit-switched  connections, 
and  paoket/measaqa  switched  connections.  The  circuit  switching  technique  use 
la  the  byte-oriented,  sychronous,  time-division-multiplexed  (TDM)  switching 
in  accordance  with  CCITT  standards.  The  basic  channels  are  connected  through 
the  network  as  transparent  and  iaochronoua  circuit  of  64  kb/s  or  nx64  kb/s 
where  n  is  typically  32.  Possible  uses  of  the  64  kb/s  unrestricted  circuits 
ere  shown  in  Pig  4. 

The  basic  channel  structure  uaad  in  ISDN  has  T  and  S  reference  points  and 
consists  of  two  D  channels  at  64  kb/a  and  one  D  channel  at  16  kb/s.  One  or 
both  of  the  U  channels  may  not  be  supported  beyond  the  interface.  The  B 
channel  ia  a  pure  digital  faaility  Ithat  is,  it  can  be  used  as 
circuit-switched,  packet-switched,  or  as  a  non-switched/nailed  facility), 
while  the  D  channel  aan  be  used  for  signalling,  telemetry,  and  packetswitched 
data.  The  basic  ecceas  allows  the  alternate  or  simultaneous  use  of  a  number 
of  terminals.  These  terminals  could  deal  with  different  services  and  could  be 
of  different  types. 

The  primary  rata  B-channal  structure  ia  composed  of  23  B  or  30  B  channels 
(depending  on  the  national  digital  hierarchy  primary  rate,  that  is,  1544  or 
2040  kb/s  and  one  D  oheni.cl  at  64  kb/e.  PABX  connection  to  the  T  reference 
point  can  use  (depending  on  its  size)  multiple  basic  channel  structure 
accesses,  a  primary  rate  tl-channel  structure,  or  one  more  primary  rate 
transmission  systemn  with  a  common  D  channels.  The  primary  rate  H-channel 
interface  structures  are  composed  of  Ho  channels  (384  kb/s)  with  or  without  a 
D  channel,  or  an  111  channel  (  1536  kb/s).  H  channels  aan  be  used  for 
high-fidelity  sound,  high-speed  facsimile,  high-speed  data,  and  viden. 
Primary  rate  mixed  Ho  and  B-channel  structures  are  also  possible.  Subrate 
channel  structures  are  composed  of  less  than  64  kb/s  channels  and  are  rate 
adapted  and  multiplexed  into  B  channels. 

Future  evolution  of  the  ISDN  will  likely  include  the  switching  of 
broadband  services  at  bit  rates  greater  than  64  kb/s  ,  at  the  primary  rate, 
as  well  as  switching  at  bit  rates  lower  than  64  kb/s  which  are  made  possible 
by  the  end-to-end  digital  connectivity.  Table  I  shows  some  typical  service 
requirements  for  civil  and  also  for  military  applications. 


Table  I ■  Some  Service  Requirements 
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X 

Interactive  Data 
Communications 

m 

-Q 

■sr 

vo 
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00 

X  X 

X 

Electronic  Mail 

4.8-64  kb/a 

X 

X 

Bulk  Data 

Transfer 

4.8-64  kb/s 

X 

X 

f  aosiini  le/ 
Graphics 

4.8-64  kb/s 

X 

X 

Slow  Scan/ 

Freeze  Frame  TV 

56-64  kb/s 

X 

X 

Compressed  Video 
Conference 

1.5-2  Mb/a 
(Primary  rate) 

X 

X 
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In  the  ISON  environment,  the  use  of  common  channel  signalling  networks 
nignif icantly  reduces  the  call  setup  and  disconnect  timest  use  of  Digital 
Speech  Interpolation  (DSI)  can  enhance  the  transmission  efficiency  on  a 
cost-driven  basis. 

Packet  switching  (5,6)  which  allocates  bandwith  on  a  dynamic  basis,  has 
become  the  preferred  technique  for  data  communications.  In  addition  to 
utilising  the  bandwith  more  efficiently,  packet  switching  permits  protocol 
conversion,  error  control,  and  achieves  fast  response  times  needed  for 
interactive  data  communications. 

Looking  ahead  into  the  future  both  for  military  and  civil  applications, 
we  see  good  prospects  for  the  integration  of  voice  and  data  traffic. 
Investigation  of  different  technique*  permitting  integration  of  voice  and 
data  traffic  in  one  network  has  bsen  s  subjsct  of  ongoing  rssearch  for  mors 
than  a  decade.  These  techniques  include  hybrid  switching  (7),  burst  switching 
(8),  end  packet  switching  for  speech  end  data  (9).  A  common  objective  of  all 
these  techniques  is  to  improve  efficiency  of  speech  connections  in  comparison 
with  the  circuit-switched  network,  with  minimal  degradation  to  speeah  quality 
an  a  result  of  clipping  and  message  delay. 

Hybrid  switching  can  achieve  acceptable  voice  message  delays.  However, 
lower  transmission  efficiency  and  higher  complexity  than  packet-switching 
concepts  render  it  unattractive  for  application  in  public  switched  networks. 

Burst  switching  achiavss  high  transmission  efficiencies  and  low  voice 
message  delays.  It  1b  an  attractive  concept,  but  high  aosts  associated  with 
the  development  of  a  new  family  of  switching  systems  and  the  lack  of 
evolutionary  migration  paths  for  impl indentation  make  it  unsuitable  for  public 
networks. 

The  attraction  of  speech  packet  communications  (9)  lias  in  tha  relative 
simplicity  of  packet-switching  concepts,  and  the  fact  that  computer  systems 
for  data  packet  switching  can  be  adopted  for  speeah  packet  comumn lost Iona, 
while  existing  protocols  for  packet  data  communications  such  as  X.33  are  not 
suitable  for  achieving  small  fixed  delays  nacassary  in  speech  packet 
communications,  significant  progress  has  been  made  in  developing  new 
protocols  undsr  tha  sponsorship  of  the  Defence  Advanced  Projects  Agency 
(DARPA)  (10,11)  and  the  Defence  Communication  Agency  (DCA) >  While  still  in  a 
developmental  stage,  apaech  paoketiaation  increasingly  appears  to  be  the 
prime  contender  for  future  voioe/date  integration  in  common-user  networks. 

Another  speculative  impetus  for  speech  packet  communications  lies  In  the 
potential  for  voice  recognition  and  direct  speech  input  to  program,  command, 
and  control  the  operation  of  artificial  intelligence  machines.  Speech  packet 
communications  are  ideally  suited  for  auch  applications. 

3.  OPERATIONAL  REQUIREMENTS 

Tha  requirements  for  air  warfare  are  subsumed  in  the  total  requirement 
for  the  switched  networks. The  network  must  be  dimensioned  to  meat  the  needs 
of  non-mobile  military  traffic  securely,  reliably,  eurvivebly,  end  with  no 
operationally  significant  delays,  so  as  to  preserve  the  radio-frequency 
spectrum  for  mobile  and  broadcast  applications,  including  possibly  the 
restoration  or  reconfiguration  of  the  static  network  and/or  rerouting  of 
traffic  following  battla  damage.  The  survivability  of  the  communications  must 
(at  laaat)  match  that  of  the  war  haadquarteci  and  weapon  sltea  which  it 
lntegratea  end  serve*.  Operational  procedures  must  be  developed  to  maintain 
essential  oparation,  even  when  tha  capacity  of  thia  network  ha*  baen 
seriously  reduced  by  battle  damage.  Survivability  of  connectivity  is  however 
of  paramount  importance. 

The  satellite  network  must  similarly  be  dimensioned  to  meet  the  joint 
requirements  of  ita  total  user  community  which  comprises  primarily  those 
difficult  to  acceea  otherwise  because  oft 

a)  Long  range  (end  relatively  large  date-rate)  requirements 

b)  mobility, 

o)  multi-ecceea  requirements. 

Ite  security  end  KCM  resistance  must  be  assured  end  its  potential 
any-to-any  and  any-to-ell  capability  must  be  made  available  for  flexible 
explotetion  by  the  uear. 

Security  and  *CM  resistance  ere  equally  required  for  the  various  tell 
links. 
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Air-ground  and  ground-air  links  for  close  fighter  control  cannot  tolerate 
delays  at  more  than  a  fraction  of  a  second  whon  they  are  part  of  a 
close-control  loop.  The  true  data-rate  in  information-theory  terms,  is  not 
more  than  100  bits.  Tt  is  essantial  that  the  interface  to  the  pilot  will  be 
user-friendly,  and  this  should  normally  include  (possibly  synthesised)  spoken 
messages,  in  order  to  keep  the  pilot's  eyes  free  for  his  primary  duties. 
Immunity  to  even  short-term  disruption  by  ECM  is  essential.  The  air-ground 
capacity  required  is  marginally  smaller  than  that  for  ground-air. 

Broadcast  Control  can  acoept  slightly  longar  delays,  bub  it  involves  a 
mora  varied  type  of  data  and  may  involva  a  somewhat  larger  total  data  ratal 
it  may  also  requira  more  air-ground  traffic.  The  need  far  communications  with 
cloaa-support  strike  aircraft,  in  e  confused  and  rapidly  ohanging  battle 
■ituation  are  similar  to  thoaa  for  fightara,  but  with  increased  flexibility 
and  aapacity  in  the  air-ground  direction. 

bong-range  deep-penetration  misaiona  must  be  accessible  to  relatively  few 
and  short  re-targeting  and  raoall  masaagea.  in  principle,  the  data  rates  need 
to  exceed  a  few  bits  par  aeoond,  and  delaya  of  possibly  several  minutes  could 
be  tolerated  it  necessary,  in  the  reply  direction  acknowledgement!  and 
reports  of  survival  or  otherwiae,  and  of  aucoasa  or  failure  of  a  strike 
mlsnlon  are  equally  undemanding  in  terms  of  communications  aapacity.  Any 
rsoonnaiaance  reports  from  long  range  oould  alao  tolerate  a  delay  of  a  taw 
minutes  if  neaaaaary,  but  aven  with  data  reduction,  reconnaissance  reports 
(from  any  range)  can  banefit  from  tha  widest  bandwidth  which  can  be  provided 
with  the  technology  available.  For  long-range  missions,  low  probability  of 
intaroept  would  alao  ba  highly  daair  .ble, 

If  the  technology  dictates  a  sharp  division  in  capability  and/or  aolutlun 
between  operational 

al  within  line-of-aight  from  the  ground  behind  the  FRUA, 

b)  within  line-of-sight  from  the  air  behind  the  FRUA , 

c)  beyond  line-ot-sight  from  the  FRUA, 

good,  but  dletinub,  solutions  to  these  three  scenarios  can  be  auoepted. 

The  operational  requirements  outlined  above  do  certainly  imply,  in 
addition  to  gtaphiu  and  data  communications,  the  use  of  voice, 
intelligibility  is  h  •  must  Important  parameter  with  "speaker  recognition" 
aiding  "authentic  .on"  being  also  required  although  its  value  In  a 
multinational  env.'  onment  may  lie  questioned. 

LiJU^iiUiji..tna...UL.ctm\iifi  iiiil  .EliuUunlu ..ttutlim 

We  must  now  turn  our  attention  to  tha  raalrlutlune  that  propagation 
conditions  and  jamming  impose  on  tha  uapaulty  of  IIF  end  satellite  channels 
that  are  to  be  used  to  support  long-distance  communications  to  and  from  the 
mobile  platform*. 

IIF  la  In  uas  as  a  primary  means  of  communication  between  aircraft  and  the 
ground  over  distances  beyond  the  1  Inr -of -eight  (Lost  fur  naval 
common  lust.  Iona ,  ship-ship,  ahlp-ahora  ami  alilpa.ii  .  Its  principal  advantage  Is 
that  it  provides  connectivity  at  low  Host. ,  so  tnat  tt  will  continue  to  be 
uaad  in  a  variety  of  rnleai 

-  on  large  aircraft  (e.g.  bumbern,  AKW)  as  back-up  to  HATi'OM  to  Increase 
the  cost  to  the  enmy  of  ROM  and  to  provide  medium  redundancy 

-  on  small  aircraft  (e.g.  fighters,  helicopters)  which  are  not  provided 
with  UATCOM 

-for  a  wide  variety  of  naval  cummunluallona. 

Present  e/eteme  ere  perceived  to  have  e  number  of  weekneasas  in  addition 
to  the  Inherent  dispersive  neture  of  the  channel  Itself.  However  techniques 
have  been  proposed  which  could  alleviate  or  eliminate  those  weaknesses*,  It 
le  believed  that  providing  such  teuhnlque*  ere  employed,  HF  will  continue  to 
provide  connectivity  at  low  coat  even  in  the  more  difficult  jamming 
environment  to  ba  expected  in  the  future.  It  muat  ba  recognised  however  that 
high  bit  retee  are  not  conildered  to  ba  achievable  -  what  la  offered  ie  e  bit 
rate  in  the  order  of  2.4  kblta/e  under  favourable  conditional  degrading  to 
about  100  blta/a  under  eevere  jamming  conditions,  (t  muat  alau  be  recognised 
that  41 . HI  availability  la  not  auhieveble,  since  even  if  the  offsets  of 
Interference  etc.  (which  provide  the  principal  limitation  or  present  system*) 
are  overcame,  there  ere  residual  affect*  Juuh  e*  disturbance  of  the  medium  by 
various  natural  causes,  end  possible  nuclear  explosion*  in  the  atmosphere, 
which  will  make  it  extremely  difficult,  if  not  impossible,  to  increase 
availability  above  aay  III, 


Satellite  communication*  to  he  u**d  both  for  th*  switched  networks  •  * 
wsll  a*  for  mobila  uaar*  i*  axpscted  to  consist  of  multiple  satellites 
operating  both  in  th*  B/7  OH*  SHF  band  alao  in  tha  44,  10/30  an*  BMP  band, 

In  addition  to  qaoaynchronou*  equatorial  orbits,  inctinad  non-circular 
(molnya)  orbit*  ara  axpactad  to  be  util  lead  to  provide  9ATCOM  ooveraqe 
extending  up  to  the  polar  region*. 

Thean  aatellitea  will  uaa  multi-beam  receiver  antenna*  with  nulling 
capability  and  multi-beam  tranamit  antenna*  and  prooesalng  tranaponder*  aa  a 
meature  for  countering  th*  RCN  threat. 

turn*  of  these  latallitee  will  be  owned  and  operated  by  NATO  while  other* 
will  be  owned  and  operated  by  variou*  NATO  nation*.  The  sitcom  capacity 
offered  by  these  NATO  and  national  senate  will  be  exchanged  under  various 
Memoranda  of  Understanding  (similar  to  current  practiue)  to  inoreaee  the 
survivability  NATO  and  national  military  common-uaer  networks  thin  msy 
require  interoperability  between  NATO  and  national  system, 

The  main  advantage!  of  BMP  SATOONa  for  communicatlcn  using  sms  11 
tsrmlnals  on  mobilt  plstforms,  as  compared  to  th*  presently  used  8IIP  and  UIIF 
SATCONe  are  increased  anti- jam  (Ad)  capability  improvement  in  oovertneaa  of 
communication*  and  increased  Immunity  to  the  disturbing  effects  ctueed  in  the 
propagation  path  by  high  altitude  nuclear  detonations. 

One  geoetatlonary  satellite  situeted  over  the  east  Atlantic  uan  provide 
suffiolent  oovatage  for  communication  among  terminals  within  the  NATO  ACM 
(and  also  Atlantic)  region.  To  provide  coverage  at  latitudes  abovs 
approximately  ll  (especially  if  oammunlcetlcna  (rum  the  polar  regions  ara 
required),  a  constellation  of  satellites  utilising  Inclined  nonutrcular 
orbits  will  be  required,  tnter-aatel 1 U*  links  may  bw  used  to  provide 
connectivity  between  users  accessing  different  satellites, 

The  Kiir  satellites  serving  th*  alrburn*  users  are  expected  to  uee  the  44 
tills  uplink  and  BO  Oils  downlink  frequency  hands  with  ths  sets!  I  it*  bandwith 
available  in  the  uplink  ami  tha  downlink  directions  being  3  tills  and  1  Oils 
respectively,  Frequency  hupping  le  ekpauted  tu  be  used  as  the  spread-apautrum 
AJ  modulation  technique  so  ee  tu  fully  exploit  the  available  transmission 
bamtwldtha  land  also  to  minimise  (die  dt shir  bancs*  (rum  high  altitude  nuclear 
explosions) . 

tin  board  piocssxlng  Involving  dshupping/i shopping  or 
dehouping/damodulatlon/remudulation/rehopping  techniques  are  expected  to  ha 
utilised  in  theee  aalfUitle*.  Muolt  a  processing  transponder  will  provide  A,J 
performance  improvement  superior  to  that  which  uan  be  provided  l.y  a 
conventional  non-proi  satng  tranaponder.  Furthermore,  such  a  processing 
transponder  will  transform  the  available  3  this  uplink  bandwidth  into  a  1  tills 
downlink  bandwidth  hence  permitting  the  full  utilisation  or  tha  wider 
spreading  bandwidth  available  in  the  uplink  direction, 

II  is  assumed  that  llisse  satellites  will  use  multi-beam  i-ucalv*  antennae 
with  adaptive  spatial  nulling  capability  and  multiple  spoll.eem  transmit 
antennae  (or  Increased  lamming  resistance, 

The  critical  direction  In  a  MATtiuM  link  to  an  aircraft  will  lie 
transmission  (rum  the  aircraft  In  the  home  base  direct  tun  because  the 
aircraft  hatcom  terminal  has  a  email  transmit  MINI*. 

It  can  ha  shown  that  an  airurafti  having  an  NIIF  iiatcom  terminal  with  at) 
dtiw  Ml  HP  can  support  a  data  rate  of  approximately  nut)  bps  under  e  postulated 
maximum  lava!  of  uplink  )amming  of  say  IBM  dttw  HUM1,  This  trsrfic  capacity 
assumes  the  use  of  s  procssslnu  satellite  with  3  Ull*  spaed  spectrum  I  hopping  I 
bandwidth  and  a  nulling  Satellite  receive  antenna  with  Ik  dh  nulling  in  tits 
lammer  direction.  The  method  of  calculating  the  )ama*.l  traffic  capacity  Is 
given  in  Annex  1.  It  should  ba  noted  that  tha  calculated  traffic  capacity  la 
not  a  function  of  the  type  of  orbtt  used  by  the  satellite. 

Downlink  lamming  of  ths  aircraft  receiving  terminal  is  considered  s 
leaser  threat  since  th*  use  of  spread  spectrum  techniques  and  highly 
directional  raualvlng  antennas  with  lew  sidelobea  would  have  to  be  in 
ltne-of -aieht  to  the  lammer  and  would  require  tha  lammer  to  use  a  directional 
antenna  ana  this  needs  to  he  repeated  for  each  aircraft. 

Aa  uan  be  seen  in  Fig  1,  re-lay  aircraft  may  ba  oaed  to  provide 
unreatrloted  communications  tu  aircraft  or  miasllee  up  to  about  300  km  bayond 
th*  TUBA.  Aa  for  HF  and  ■stellite  links,  rsityed  links  to  tha  aircraft  would 
alio  be  vulnerable  to  ground-and  ale-borne  jammers .  Th*  achievable  maximum 
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range  ratio  R  (TX  to  RX  distance  divided  by  jammer  to  RX  distance)  for  a 
given  data  rote  and  threat  level  is  obtained  fronu 

R2  *  (J/S) . (Pg/Pj) 

where  an  ultimate  anti-jam  margin  is  given  by  J/S*|200/B)  (1/3).  This 
assumes  a  spread  bandwith  of  200  MHz  and  B  is  the  data  rate  in  Mb/s.  This 
relation  la  plotted  in  Fig  t.  Under  a  pessimistic  assumption)  R«10:l  in 
favour  of  the  enemy  and  P  >100  W  and  Pj -100  kw  gives  a  maximum  data  rate  or 
about  200  bits/s.  "  3 

3.2  Cockpit  Engineering  (12) 

The  basic  piloting  functions  are  tha  following) 

-  flying  (control  of  aircraft  manoeuvres) 

-  navigation  (location  and  guidance) 

-  communications  (voice  and  data  link) 

-  utilities  management 

-  mission  management 

Decisions  to  be  made  by  the  pilot  related  to  tha  above  tasks  are 
crucially  influenced  by  how  information  is  obtained,  and  displayed  and  how 
communications  ere  processed  and  handled.  There  is  also  tha  problem  of 
language  between  the  machine  and  the  man.  Evan  whan  the  machine  is  as  learned 
as  the  pilot,  it  will  not  always  know  what  part  of  its  knowledge:  is  to  be 
transmitted  to  the  pilot  or  how  to  optimally  transmit  it. 

It  is  generally  accepted  that  in  many  current  military  aircraft, 

particularly  slgnle-crew  aircarft.  pilot  workload  is  sxoessive  and  can  be  a 
limit  to  the  capability  of  the  aircraft  as  an  operational  weapon  system. 
Advances  in  on-board  avionics  systems  have  the  potential  for  generating  more 
information,  and  considerable  care  will  be  required  in  optimising  the 

man-system  interface  in  order  that  the  human  pilot  capability  (which  will  be 
essentially  unchanged)  is  not  a  major  constraint  on  overall  system 
performance. 

Ideally  the  man  and  arlcreft  system*  would  together  be  designed  as  a 
total  system.  This  concept  is  constrained  by  soma  apecial  features  of  the  man 
which  include  his  performance  variability  (from  man-to-man  and  from 
day-to-day)  the  methods  required  to  load  information  into  him,  and  his 

particular  input/output  channels.  At  hs  present  time  the  man  has  some 

important  capabilities  which,  in  the  short  term,  are  unllkey  to  be  attainable 
with  machines.  These  include) 

-  Complex  pattern  recognition 

-  Assessment  and  deci  sic-making  concerning  complex  situations 

-  1 ntuttl v« judgement . 

Although  computer)!  currently  excel  in  analysis  and  numerical  computation, 
their  capabilities  in  the  field  of  artificial  intelligence  are  developing  to 
the  point  at  which  tho  man's  capabilities  in  complex  assessment  and 
decl « l  mwnaking  may  be  overtaken.  Tho  implications  for  the  design  of 
man-machine  interfaces  have  not  yet  been  explored,  anti  could  raise  some 
important  and  fundamental  new  Issues. 

Another  difference  between  man  and  machine  is  in  integrity  and  failure 
mechanisms.  For  the  foraoable  future,  man  is  likely  to  have  a  unique 
capability  to  combine  extremely  hiqh  integrity  with  complex  high-bandwidth 
operation.  Tha  integrity  implications  of  artificial  intelligence  will 
certainly  require  much  study.  At  the  same  time  it  must  be  recognised  that  the 
demands  on  pilots  of  modern  aircraft  are  such  that  accidents  happen  far  too 
frequently,  and  it  should  be  an  aim  of  overall  system  design  to  reduce  the 
frequency  of  human  failure.  Better  simulation  and  brisfing  prior  to  flying 
the  aircraft  may  be  an  important  development,  which  will  arise  fro.a  new 
electronic  techniques. 

Current  advances  in  control  display  technology  may  be  projected  into  the 
future  and  we  may  predict  that,  by  2005-2010  we  will  be  able  to  operationally 
field  advanced  devices  within  the  cosntraints  of  tactical  aircraft  size, 
weight,  and  cost.  Ir.  this  tin  period  digital  processing  is  expected  to  be 
orders  of  magnitude  less  cost  /,  in  terms  of  size,  weight  and  dollars,  than 
current  equipment.  In  addition  we  may  confidently  expect  that  AI  languages 
and  progamminq  aids  will  maka  it  much  easier  to  generate  complex  computer 
programs  that  will  be  able  to  cost-effectively  solve  problems  that  currently 
rsquira  human  Intelligence.  These  advances  will  result  ln> 


-  Head-up  and  eyes-out  panoramic  display*  with  large  fields  of  view,  high 
recolution,  color,  and  if  desired,  enhanced  stereo  depth  cues. 

-  Ability  to  synthesize  "real  world"  imagery  and  pictorial  tactical 
situation  displays  that  recreate  clear  day  visual  perception  under  night  and 
adverse  weather  conditions. 

-  High  quality  voice  synthesis  and  robust  voice  recognition. 

-  Natural  control  of  sensors,  weapons  system,  aircraft  flight,  and 
display  modes  based  on  head  and  sy*  position,  finger  position,  and  other 
"body  language"  modalities. 

-  Great  simplification  of  tasks  that  require  transfer  of  complex  tactical 
situation  information  from  the  system  to  the  aircrew,  and  rapid  application 
of  the  aircrew's  superior  cognitive  powers  to  management  of  the  weapon 
system. 

Around  the  year  2000  airaraft  displays  ar*  not  only  windows  on  the  status 
of  flight  but  are  vital  in  the  decision  making  process  during  certain  stages 
of  the  mission.  Especially  during  those  phases  with  a  high  pilot  workload, 
the  mission  and  aircraft  data  must  be  formatted  and  displayed  in  such  a  way 
that  the  quality  and  rate  of  information  to  be  extracted  by  the  pilot  is 
sufficient  to  arrive  at  major  complex  decisions  within  2  to  S  seconds  without 
exceeding  the  pilot's  peak  workload  capacity.  In  some  cases  this  also  Implies 
that  the  pilot  must  delegate  some  of  the  lower  priority  (but  still  important) 
decisions  to  an  automated  device  without  risk  of  conflict.  The  diaplayo 
should  also  enable  him  to  evaluate  such  risks. 

3.3  The  Use  of  Volos  Systems  in  the  Cockpit 

t 

Visual  signals  are  spatially  confined)  one  needs  to  direct  the 
field-of-view,  moreover,  in  high  workload  phases  of  the  mission,  attention 
can  be  focussed  on  some  types  of  visual  information  such  that  other 
information  which  suddenly  becomes  important  can  b*  "overloaded".  Also  the 
amount  of  information  may  saturate  the  visual  channel  capacity.  Aural  signals 
have  the  advantage  of  being  absorbed  independently  of  visual  engagement, 
while  man'B  information  acquisition  capacity  is  increased  by  using  ths  two 
channels  simultaneously.  Hotoric  skills  ar*  hardly  affsotad  by  speaking.  For 
information  being  sent  from  aircraft  to  othar  humana  on  the  ground  and  in  the 
air,  speech  is  a  natural  and  efficient  technique  which  has  been  used  for  many 
years.  Until  now  the  process  of  aural  communication  between  aircrew  and 
systems  has  been  usually  restricted  to  a  limited  range  of  warning  signals 
generated  by  the  systems. 

Digital  voice  synthesis  devices  are  now  widely  available  and  have  many 
commercial  applications  (see  section  4.3).  Technically  there  appear  to  be  no 
problems  in  using  them  in  aircraft  to  transfer  data  from  aircraft  systems  to 
aircrew,  the  real  difficulty  being  in  identifying  the  types  of  message  which 
are  best  suited  to  this  technique.  Warning  messages  currently  appear  to  be  a 
particularly  useful  application,  though  these  will  probably  need  to  be 
reinforced  by  visual  warnings  aB  aircrew  can  totally  miss  aural  warninga 
under  some  conditions.  Feedback  of  simple  numerical  data  is  also  being 
considered. 

One  of  the  main  disadvantages  of  aural  signals  is  that  ths 
Intelligibility  is  greatly  impaired  by  noise  in  the  cockpit)  this  is  true 
both  ways.  However,  the  understanding  of  the  mechanisms  of  speech  synthesis 
and  speech  recognition  has  reached  the  point  where  voice  systems  in  the 
cockipt  can  be  considered.  Although  electronic  voice  recognition  in  tho 
laboratory  reaches  scores:  of  96  to  98%  (comparable  with  keyboard  inputs)  the 
vocabulary  is  still  very  limited  and  recognition  tends  to  be  personalised. 
But  the  prospect  of  logic  manipulation  in  AI  techniques  cun  greatly  improve 
the  situation  to  depersonalise  recognition  in  noisy  environments.  Actual  data 
on  such  improvements  ar*  difficult  to  obtain.  These  would  also  depend  on  how 
much  redundancy  is  used  in  both  syntax  and  aamantlcs.  Furthermore  a  coding 
"language"  is  to  ba  preferred  just  as  in  conventional  aircraft  radio 
communication,  to  prevent  the  system  responding  to  unvoluntarily  uttered 
(emotional)  exclamations. 

Several  commercial  voice  recognition  equipment*  are  currently  available 
on  the  open  market,  but  these  have  not  been  designed  for  airborne  application 
and  considerable  development  will  be  needed  before  they  can  ba  regarded  as 
uaabla  equipments  for  combat  aircraft.  Simulator  and  airborne  trials  in  a 
number  of  countries  using  thin  early  equipment  have  Identified  the  following 
as  key  areas  in  which  further  inveatigatlon/improvement  is  required) 

a)  Size  of  vocabulary.  At  present  this  is  very  limited,  but  recognition 
performance  is  generally  inversely  related  to  vocabulary  size. 
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b)  Background  noise/distortion.  The  cockpit  environment  !■  frequently 
very  poor,  and  the  oxygen  meek  and  microphone  are  far  from  Ideal. 

c)  Necessity  for  pre-loading  voice  aignaturaa.  Current  ayatema  have  to  be 
loaded  with  individual  voice  template*.  Consequently,  if  aircerw  voice 
changoa  (a.g.,  under  atreas)  recognition  performance  ia  reduced.  Moreover 
some  subjects  have  a  much  greater  natural  variability  in  their  voloea  than 
others. 

d)  Continuoua  apaech  recognition.  Most  early  equipments  can  only 
recognise  isolated  words,  whereas  in  natural  speech  the  speaker  frequently 
allows  one  word  to  flow  continuously  into  the  neat. 

e)  Recognition  Performance.  Sven  under  ideal  conditiono,  recognition 
aoorea  are  always  less  than  100%  and  undar  bad  conditions  and  with  poor 
subjects  the  acoraa  may  ba  only  50-7SI.  Thus  it  is  eurrantly  nacsaaary  and 
probably  will  continue  to  be  so-to  have  some  form  of  feddbaak  to  confirm  to 
the  speaker  that  the  message  has  been  correctly  "captured". 

In  summary,  trials  with  first  genaration  voioa  recognition  equipments 
have  produced  anoouraging  results,  but  the  nead  tor  significant  improvements 
has  been  identified  and  thaaa  ara  now  being  explored.  It  may  be  too  early  to 
give  an  exact  aatimate  of  tha  nt  to  which  voice  recognition  technique* 
will  be  used  in  future  combat  »i. craft,  but  there  ia  considareble  promis# 
that  a  valuable  new  interface  channel  can  be  developed.  First  applications 
ara  likely  to  be  in  areas  where  100%  accuracy  in  data  transmission  is  not 
essential  and  weher  an  alternative  form  of  data  input  ia  also  available  to 
Aircrew. 

3.4  summary  of  -Higultemonta 

Efficient  use  of  transmission  madia  and  restriction  imposed  on  the  radio 
channel  capacity  by  propagational  factors  and  by  jamming  fores  tha  channel 
bit  rate  to  be  kept  as  low  as  possible.  Undar  no  stress  conditions  rates 
ranging  from  n  x  64  kb/a  to  2.4  kb/a  are  required  while  undar  heavy  jamming, 
tha  supportable  information  rets  can  go  down  to  600-200  bit/s.  Under  all 
theae  conditions  voice  is  purceived  as  a  preferred  method  of  man-toman 
communications  and  this  requires  apaech  coding  from  (4  kb/a  down  to  a  few 
hundred  bits/s.  Sophisticated  speech  coding  methode  including  variable  rate 
encoding  together  with  Digital  Circuit  Multiplication  (DCM)  ere  invoked  aleo 
to  overcome  in  the  ehort-to-medium  term  the  areee  of  economic  weakness  of  64 
kb/a  PCM,  namely,  satellite  and  long-haul  terrsatrial  linka  used  in  switched 
networks  prior  to  widespread  availability  of  optical  fibre  links.  As  far  aa 
intra-airoraf t  communications  are  concerned  machina-to-man  (speeoh  ayntheaia) 
and  man-to-machine  (speech  recognition)  voice  communications  are  considered 
very  naceaaery  because  this  leaves  the  hands  and  tha  ayes  free  to  psrform 
othar  functions  in  the  coakpit. 

It  is  to  be  noted  that  the  military  always  try  to  use,  to  the  maximum 
extent  possible,  the  civil  networks  which  benefit  from  the  eaonomi** 
or  ne«i*.  There  are,  however,  requirements  such  aa  survivability,  Maturity, 
mobility  and  precedence/pre-omption  that  are  regarded  as  vital  by  the 
military  but  not  considered  important  for  civil  applications.  The  experience 
shows  however  that  the  service  features  required  by  the  military  in  tints, 
become  requirements  also  ror  the  civilian  systems.  This  is  certainly  true  as 
far  as  tha  following  network  features  and  trends  (13)  ere  concernedt 

i)  Voice  coding  at  sub-rates  of  64  kb/e  and  as  low  as  16  kb/a  and  even 
lower,  for  lonq  connections  and  mobile  applications, 

ii)  Soesch  synthesis  and  apaech  recognition  using  subrotes  of  64  kb/s,  for 
instance  for  voico  message  services  and  recorded  announcements, 

iii)  Digital  Circuit  Multiplication  (DCM!  for  making  more  efficiunt  use  of 
tha  transmission  media. 

iv)  bong-term  objective  of  integrating  voice,  data  and  imagery  in  the 
evolving  broad-band  ISDN  (14)  when  the  "Asynchronous  Transfer  Mode" 
(ATM)  of  operation  ia  expected  to  be  implemented  using  paokotised 
speech.  DCM  applications  are  related  to  the  use  of  digital  links  at 
speeds  on  the  order  of  few  Mb/s,  while  I0DN-ATM  applications  are 
foreseen  at  much  higher  limit  speeds  (i.e.,  50-160  Mb/s). 

There  are,  however,  important  differences  between  the  military  end  civil 
applications  as  far  as  environmental  factors  ara  concerned;  acoustic  noise, 
vibration,  acceleration  and  jamming  are  some  of  them.  In  the  lectures  that 
follow,  speech  processing  will  be  treated  in  all  its  aspects  considering  both 
civil  and  military  requirements  and  applications. 
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4.  flPBSCH  PROCESSING 

Having  established  tha  fact  that  tha  spokan  word  plays  and  will  continue 
to  play  a  significant  role  in  man-man,  man-machine  and  machine-man 
communications  for  air  operations  and  other  applications  both  raal-tima  and 
with  intermediate  storage  (a.g.,  for  "voice  mail")  a  brief  look  will  now  be 
taken  at  the  developments  in  speech  processing  that  contribute  significantly 
in  all  these  areas. 

The  problem  of  epaeoh  compression  and  composition,  i.a.,  Speech 
Processing  sross  out  of  a  study  of  ths  human  voios,  for  example,  Alexander 
Graham  Bell  and  his  father  and  later  Sir  Richard  Paget  (16)  and  others  had 
studied  speech  production  and  oparation  of  tha  ear  (IS) .  In  1939  Homar  Dudley 
(17,16)  demonstrated  the  Vocoder  at  tha  New  York  World's  Pair.  This 
instrument  produced  artificial  voioa  sounds,  controlled  by  the  preaaing  of 
keys,  and  could  ba  made  to  "speak"  whan  controlled  manually  by  a  trained 
operator.  In  1936  Dudley  had  demonstrated  tha  more  important  Vocoder t  this 
apparatus  gave  essentially  a  means  for  automatically  analysing  speech  and 
reconstructing  or  synthesising  it.  Tha  British  Post  Office  also  started,  at 
about  this  data,  on  an  independent  program  of  development,  largely  due  to 
Halaay  and  swaffield  (19) .  Despite  tha  marginal  qualitiy,  vocoder  was  used  on 
High-frequency  radio  on,  for  instance,  transatlantic  routs*,  to  provide  full 
digital  seourltyi  The  inauguration  in  i960  of  the  first  transatlantic 
undersea  cable  providing  36  voios  circuits.  (1  Mil. 9  par  channel)  encouraged 
work  on  bandvith  conservation.  This  lad  to  tha  deployment  in  1959  of  a  epaeoh 
processing  technique  known  as  TAS1  (Tima  Assignment  Speech  Interpolation) 
which  doubled  tha  aapacity  of  ths  cable  by  taking  advantages  of  limited  voioe 
activity  during  a  oalli  only  the  active  parts  of  a  conversation  (talkspurta) 
art  transmitted. 

efforts  in  speech  processing  continued,  driven  by  tha  requirement  to  use 
transmission  capacity  efficiently,  till  about  1970  whan  making  computers  more 
useful  for  humans  emerged  ea  ths  trend  spurred  by  the  advance!  and 
proliferation  of  digital  computers.  This  interest  cantered  on  tha  use  of 
human  voioa  for  man-aomputar  interaction.  Speech  synthesis  concern* 
maahlna-toman  communication  (talking  machine)  and  epaeoh  cognition  allows 
machines  to  listen  to  and  "understand"  human  speech.  Most  of  the  technology 
for  reducing  speech  bandwith  applies  to  speech  synthesis  and  recognition,  but 
the  objective  of  achieving  transmission  efficiency  still  remains  as  tha  main 
motivation  for  speech  processing  work  despite  the  promise  of  very  large 
bandwidth  from  optical  fibres. 

Pig  7  shows  roughly  the  relationship  batwssn  speech  transmission  and 
recognition  and  aythesis  of  spaaoh  (20).  In  each  case  processing  starts  with 
"preprocessing"  which  extracts  some  Important  signal  characteristics.  The 
following  stage  which  is  still  a  preprocessing  stage  but  extracts  mors 
complicated  and  combinatorial  parameters  such  as  segmented  phoneme  parameters 
or  prosodia  parameters  liks  speech  Intonation  which  are  necessary  for  a 
speech  recognition  system.  The  succeeding  stages  ara  corcernsd  with  ths 
central  issue  of  recognition  and  understanding.  A  speech  output  is  then 
produced  based  on  linguistic  rules.  The  phonetic  and  apeeoh  synthesis  parte 
again  handle  the  higher  and  lower  level  parameters  to  produce  a  speech  signal 
which,  whan  applied  to  a  loudapeakar/earplsos,  is  converted  into  an  acoustic 
signal,  in  a  speech  transmission  system  with  redundancy  reduction 
(uonpreesion) ,  the  inner  part  of  fig  7  is  by-passed  and  a  paramatic 
description  of  the  analysed  signal  is  directly  sent  to  s  synthesiser  which 
can  reproduce  the  speech  signal. 

4.i.a,l»iigh  figfling 

Speech  compression  systems  can  generally  be  classified  as  either  Waveform 
coders  or  Vocoders  II. a.,  voios  coders  or  analysis-synthesis  telephony) . 
These  two  classes  cover  the  whole  range  of  compressibility  from  64000  down  to 
a  few  hundred  bits  per  second.  The  important  factors  which  need  to  be  isken 
into  aoaount  when  compering  different  encoding  tehuniquee  ere  the  speech 
quality  achievable  in  the  presence  of  both  transmission  errors  and  acoustic 
noise,  the  data  rata  required  for  transmission,  tna  delay  introduced  by 
processing,  tha  physical  alia  of  tha  equipment  and  tha  coat  of  implementation 
(a  function  of  coder  complexity  which  can  ba  measured  by  tha  number  of 
multiply-edd  operations  required  to  coda  epaeoh,  usually  expressed  in 
millions  of  Instructions  par  second  "HIM")  • 

The  moat  beeio  type  of  waveform  coding  ie  pulse  code  modulation  (PCM) 
consisting  of  sampling  (usually  at  I  kHa),  quantising  to  a  finite  number  of 
levels,  and  binary  anooding,  Tha  quantiser  can  have  either  uniform  or 
non-uniform  steps  giving  rise  to  linear  and  logarithmic  PCM  rsspsctivly. 
Log -PCM  has  a  much  wider  dynamlo  range  than  linear  PCM  for  a  given  number  of 


bits  psr  aampls,  because  low  amplitude  signals  are  batter  represented ,  and  as 
a  result  logarithmic  quantisation  is  nearly  always  used  in  wideband  speech 
communications  applications.  A  data  rate  of  56  to  64  kbit/s  is  required  for 
commercial  quality  speech  and  lower  rates  for  military  tactical  quality. 

There  are  many  variations  on  the  basic  PCM  idea,  the  most  common  being 
differential  encoding  and  adaptive  quantisation.  Each  variation  has  the 
object  of  reducing  the  data  rate  required  for  a  given  speech  quality,  a 
saving  of  approximately  1  bit  per  sample  (8  kbit/s)  being  achieved  when  each 
is  optimally  employed.  In  differential  PCM  (DPCM)  the  sampled  speech  signal 
is  compared  with  a  locally  decoded  version  of  the  previous  sample  prior  to 
quantisation  so  that  the  transmitted  signal  is  the  quantised  difference 
between  samples.  In  adaptive  PCM  (APCM) tha  quantiser  gain  is  adjusted  to  the 
prevailing  signal  amplitude,  either  on  a  short  term  basis  or  syllabicate . 
By  controlling  the  adaption  logic  from  the  quantiser  output,  the  quantiser 
gain  oan  be  recovered  at  the  receiver  without  the  need  for  additional 
information  to  be  transmitted.  Adaptive  differential  PCM  (ADPCM)  is  a 
combination  of  DPCM  and  APCM  which  saves  2  to  4  bits  per  sample  compared  with 
PCM,  thus  giving  48  to  32  kb/s  with  high  quality  spssoh. 

it  is  interesting  to  note  that  although  the  principle  of  DPCM  has  been 
known  for  30  years,  it  was  not  possible  to  standardise  such  a  32  kb/s  coder 
until  1983  (21),  after  efficient  and  robust  algorithms  became  available. 
These  adaptive  algorithms  are  efficient  in  the  sense  that  they  adapt 
quantisation  and  prediction  synchronously  at  the  enooder  and  decoder  without 
transmitting  explicit  adaptation  information.  They  are  robust  in  the  sense 
that  they  function  reasonably  well  even  in  moderate  bit-error  environment. 

There  is  another  adaptive  approach  to  producing  high  quality  and  lowar 
bit-rate  coder  which  is  called  "adaptive  subband  coding"  which  divides  the 
speech  band  into  four  or  more  contiguous  bands  by  a  bank  of  filters  and  codes 

each  band  using  APCM.  After  lowering  the  sampling  rates  in  each  band,  an 

overall  bit  rate  can  be  obtained  while  maintaining  speach  quality;  by 
reducing  the  bits/sample  in  leas  perceptually  important  high-frequency  bands. 
Banda  with  low  energy  use  small  step  sixes,  procuding  less  quantisation  noise 
than  with  less  flexible  systems.  Furthermore,  noise  from  one  band  does  not 
affect  other  frequency  bands.  Coders  operating  at  16  kb/s  using  this 

technique  have  been  shown  to  give  high  quality  but  with  high  complexity  (22). 

Whan  the  number  of  quantisation  levels  in  DPCM  is  reduced  to  two,  delta 
modulation  (DM)  results.  The  sampling  frequency  in  this  case  is  equal  to  the 
data  rate,  but  it  has  to  be  well  above  the  Hyquist  frequency  to  ensure  that 
the  binary  quantisation  of  the  difference  signal  does  not  produce  excessive 
quantisation  noise.  Just  as  with  PCM,  there  are  many  variations  of  DM,  and 
the  right  hand  side  of  Fig  8  illustrates  some  of  them.  The  most  important  form 
of  DM  used  in  diqital  speech  communications  is  syllebically  companded  DM; 
there  are  a  number  of  closely  related  versions  of  this,  examples  being 

continuously  variable  slope  DM  (CVSD)  and  digitally  controlled  DMIDCDM).  The 
data  rate  requirements  are  a  minimum  of  about  16  kbit/s  for  military  tactical 
quality  speech  and  about  48  kbit/s  for  commercial  quality. 

When  operated  at  data  rates  of  12  kbit/s  and  lower,  the  speech  quality 
obtained  with  PCM  and  DM  aodtrs  is  poor,  and  consequently  they  cannot  be  used 
as  narrow  band  devices.  However,  the  principles  of  operation  of  wideband 
coders  are  useful  in  analysis-synthesis  telepnony  once  significant  redundancy 
has  bean  removed  from  the  speech  waveform.  Example*  of  thia  are  digital 
encoding  for  the  trenemiaeion  of  individual  epeech  parameters  and  the 
relationship  betwsan  LPC  and  DPCM  indicated  in  Pig  8. 

Analyeia-aynthoais  telephony  technique*  are  baaed  on  a  model  of  speeoh 
production.  Pig  9  (a)  shows  a  lateral  cross-section  through  the  human  head, 
and  illustrate*  the  various  organa  or  speech  production.  Briefly,  theae  are 
the  vocal  tract  running  from  the  vocal  chords  at  ths  top  of  ths  larynx  to  the 
mouth  opsning  at  the  lips,  end  the  nasal  tract  branching  off  ths  vocal  tract 
at  the  velum  and  running  to  the  nos*  opening  at  the  noatrils.  The  glottis 
(the  space  between  the  vocal  chords)  and  ths  aub-qlottal  air  prsasur*  from 
tha  lungs  together  regulate  the  flow  cf  air  into  tha  vocal  tract,  and  the 
velum  regulates  the  degree  of  coupling  between  the  vocal  end  nasal  traota 
(i.a.,  the  nasalisation). 

There  ere  two  basic  types  of  speech  sound  which  can  be  produced ,  namely 
voiced  end  unvoiced  sounds.  Voiced  sounds  occur  when  the  vocal  chords  are 
tightened  in  scuh  e  way  that  the  subglottal  sir  pressure  forces  them  to  open 
and  close  quasi -periodically,  thereby  generating  "puffs"  of  sir  which 
acoustically  excite  ths  vocal  cavities.  Tha  pitch  of  voiced  sounds  is  simply 
ths  frsqusnoy  at  which  ths  voasl  chorda  vibrate.  On  ths  other  hand,  unvoiced 
sounds  are  produced  by  forced  sir  turbulence  at  s  point  of  constriction  in 
ths  vocal  tract,  giving  rise  to  a  noiss-liks  excitation,  or  "hiss". 


A  model  of  apaaoh  production  ofton  used  for  the  design  of 
analysisHwnthesis  vocoder  la  shown  in  rig.  9  (b) .  in  thie  model,  a( number  of 
almpllfloatlona  have  been' made,  the  moat  Important  Onea  being  that  the 
exoltatlOn  aouroe  for  both  voiced  and  unvoiced  aounda  la  located  at  the 
glottis,  that  the  excitation  waveform  la  not  affected  by  the  ahape  of  the 
vocal  traofc,  and  that  the  nasbl  tract  oan  be  incorporated  by  suitably 
modifying  thavooal  tract,  theed  simplification*  lead' to differing  Subjective 
effects  4  depending  on  the > type  of  speech1 sound  and  thg  particular  vocoder 
being  used.  ;  . 

in  ohannii  voooding  the  apeeoh  ia  analyaad  by  prodeaaing  through  a  bank 
of  parallel  band-pasa  filtered  and  the  apaoon amplitude  in  each  frequency 
band  la  digltitsd1 uaingPCX  techniques,  for  ayhtheaia,  tha  1  vocal  and  naaal 
tracta  are  represented  by  a  aat  of  controlled  gain,  lossy  reaonatora,  and 
either  pulses  or  white  noise  -are  uaed  -  to  excite  tham.  In  'pitchexoited 
vocoder*,  the  excitation  ia  explicitly  derived  iii  the  analysis,  wheriaa  in 
voiee-excited  vooodera  it  ia  derived  by  non* 11 near  prooeaalng  of  the  apeeoh 
signal  In  a  few  of  the  low  frequency  ohannal*  combined  into  One. 
Pitch-excited  vooodera  require  data  ratea  In  the  range  from  1200  to  2400 
bit/a  and  yield  poor  quality  apeeoh*  whereaa  voice-excited  vooodera  will 
provide  reasonable  apaaoh  quality  at  4100  bit/e  and  good  quality  at  9600 
bit/a. 


A  formant  voaoder  is  aimilar  to  a  ohannal  vocoder,  but  haa  tha  fixed 
filter*  replaced  by  formant  tracking  filter  a.  Tha  oantra  fraquaholaa  of  theaa 
filter*  along  with  tha  corresponding  apeeoh  formant  amplitude*  ar*  the 
tranamittad  parameter* .  The  main  problem  la  in  acquiring  and  maintaining  look 
on  the  relevant  apaatral  peak*  during  vowal-oonaonant-vowal  tranaitiona,  and 
alao  during  par lode  where  tha  tormante  become  ill-defined.  The  dete  rate 
required  for  formant  vooedara  can  be  aa  low  as  100  bit/a  *  but  tha  apaaoh 
quality  ia  poor.  The  minimum  data  rata  required  to  achieve  good  quality  ia 
poor.  The  minimum  data  rata  required  to  aohlevo  good  quality  apaaoh  ia  about 
1200  bie/a.  but  to  data  thla  raault  haa  only  boan  obtained  uaing 
aoml-automated  analyaia  with  manually  interpolated  and  correotad  formant 
traoka. 

Tha  third  method  of  analyala-aynthaaia  telephony  to  have  achieved 
Importance  ia  linear  predictive  coding.  In  thia  technique  the  parameter*  of  a 
linaariaed  apeeoh  production  model  are  estimated  uaing  meen-aquare  error 
minimisation  procedure*.  Tha  parameter*  estimated  are  not  acouatlo  onaa  aa  in 
channel  and  formant  vocoder*,  but  artloulatory  one*  related  to  the  ahape  cf 
the  voaal  treat.  For  a  given  apeeoh  quality,  a  transmission  data  rata 
reduatlon  in  oomperlaon  with  aooustio  parameter  voooding  should  bo  achieved 
baoauaa  of  tha  lower  redundancy  present.  Just  aa  wtlh  ohannal  and  formant 
vocoders,  excitation  foe  tha  synthesiser  hae  to  be  derived  from  a  separate 
analyaia,  the  uaual  terminology  being  pitoh-exolted  or  residual  exoitad, 
oorreaponding  to  pitch  or  voice  excitation  in  a  channel  vocoder,  lpc  is  a 
very  active  area  of  apeeoh  research,  and  new  roaulta  appear  regularly.  At 
peraaant  data  ratea  aa  low  aa  2400  bit/s  have  been  achieved  for  pitch-excited 
liPC  with  reaaonable  quality  speech,  and  In  tha  range  from  9  kbit/*  to  It 
kbit/a  for  residual  axoitad  LPC  with  good  apaaoh  quality. 

Tha  application  of  vector  quantisation  (VO),  a  fairly  naw  dlraotlon  in 
aouroa  coding,  haa  allowed  LPC  ratea  to  ba  dramatically  reduced  to  900  b/a 
with  very  alight  reduction  in  quality,  and  further  compressed  to  rates  aa  low 
aa  190  b/a  while  retaining  intelligibility  (21,24).  This  technique  consists 
of  coding  each  act  or  vector  of  the  LPC  parameters  a*  group  instead  of 
individually  aa  in  scalar  quantisation.  Vector  quantisation  can  ba  used  alao 
for  waveform  coding. 

A  good  candidate  for  coding  at  9  kb/a  is  MultlpvUa  linear  predictive 
coding,  in  whioh  a  suitable  number  of  pulaaa  are  supplied  aa  thi  excitation 
aaquanoa  for  a  apaaoh  eegmant-perhape  10  pulses  for  a  10-ms  segment.  The 
amplitudes  and  looatlona  of  the  pulaea  ere  optimised,  pulse  by  pulse,  in  a 
closed-loop  search.  Tha  bit  rate  reserved  for  the  excitation  information  ia 
more  then  half  the  total  bit  rate  of  9  kb/s.  This  does  not  leave  much  for  the 
linear  predlotiva  flltre  information,  but  with  vq  the  oodlng  of  tha 
predictive  parameter*  oan  be  made  accurate  enough. 

for  4  kb/a  coding,  cod*  exoitad  or  atoohaetlcelly  excited  linear 
predictive  coding  is  promising.  The  ooder  atora*  a  repertory  of  candidate 
excitations,  each  a  stochastic,  or  random  aaquanoa  of  pulaaa.  Tha  bast 
aaquanoa  ia  aa lac tad  by  a  eleaadloop  aaaroh.  Vector  quantisation  in  the 
linaar  pradietiva  filter  ia  almost  a  naoaaaity  hara  to  guarantee  that  enough 
bit*  are  available  for  the  exaltation  and  prediction  parameter*.  Vector 
quantitation  ensures  good  quality  by  snowing  enough  candidate*  in  tha 
excitation  and  filter  oodabooka. 
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Tab la  II  below  compares  tradeoffs  for  representative  types  of  speech 
coding  algorithms  (35).  it  showa  tha  baat  overall  match  between  complexity, 
bit  rata  and  quality.  A  codar  typa  ia  not  naoaaaarily  limited  to  the  bit  rata 
atatad.  For  example,  tha  medium-complexity  adaptive  differential  pulae-oode 
modulation  codar  can  ba  redesigned  to  give  communication-quality  speech  at  16 
kb/a  inataad  of  high-quality  speech  at  33  kb/a.  In  fact,  a  highly  oomplex 
version  oan  provide  high-quality  apaaoh  at  tha  lower  bit  rata.  Similarly 
lover-complexity  multipulae  linear  predictive  coding  oan  yield  high-quality 
coding  at  IS  kb/a,  and  a  lowar-oomplexlty  stochastically  excited  linear 
predictive  coder  (LPC)  oan  be  designed  if  the  bit  rate  oan  ba  a  kb/a  inataad 
of  4  kb/ a. 


Coat  ia  alao  a  tradeoff  factor,  but  it  is  hard  to  quantify  in  a  table, 
Tha  oust  of  coding  hardware  generally  increases  with  complexity.  However, 
advances  in  signal  processor  technology  tend  to  daoraaaa  ooat  for  a  given 
level  of  complexity  and,  more  significantly,  to  reduce  tha  oost  difference 
between  low-complexity  and  high-complexity  techniques. 

Of  aourse,  as  enooding  and  decoding  algorithms  become  more  oomplex  they 
take  longer  to  perform.  Complex  algorithms  introduce  delay*  between  the  time 
the  apaaker  utters  a  sound  and  the  time  a  coded  version  of  it  enters  the 
transmission  systems.  These  coding  delays  oan  ba  objectionable  in  two-way 
telephone  conversation*,  aapaollaly  when  they  are  added  to  delays  in  tha 
tranamlaalor  network  and  combined  with  unoanoalad  echoes.  Coding  delay  ia  not 
a  problem  If  the  coder  is  used  in  only  one  stag*  of  coding  and  decoding,  such 
aa  in  voloe  etoraga.  If  tha  delay  is  objectionable  baoause  of  uncanooled 
echoes  tha  addition  of  an  echo  canoeist  to  tha  voice  coder  can  eliminate  or 
mitigate  the  problems.  Finally,  coding  delay  is  riot  a  concern  if  tha  apaaoh 
1*  merely  stored  in  digital  form  for  liter  delivery. 

Many  explanations  oan  ba  given  aa  to  why  particular  types  of  apaaoh  coder 
do  not  perform  wall  at  low  data  rates.  With  waveform  oodare,  it  la  generally 
accepted  that  tha  main  reason  is  excessive  quantisstion  noise  despite 
companding  and/or  adaptive  logic.  With  analysis-synthesis  techniques,  the 
main  reasons  era  over-simplification  of  tha  vocal  tract  mortal,  loading  to 
imprecise  apaotral  characterisation,  and  unreliable  pitch  dateotion  and 
voleed-unvoioad-sllenoe  decisions  in  the  analyser  which,  coupled  with  ait 
over-almpliflad  exaltation  modal  in  the  synthesiser,  lead  to  impreoiee 
temporal  characterisation  and  a  leak  of  naturalness  in  the  aynthetlo  speech. 

in  oonolusion  on  speech  coding,  it  should  be  remarked  that  there  are  two 
complementary  trends  that  ere  st  work  in  digital  telacommunioetlonai  apaaoh 
coding  developers  are  trying  to  reduce  the  bit  rate  for  a  given  quality  level 
while  developer*  of  modulation  end  demodulation  toheniques  are  endeavoring  to 
ancraaea  the  bit  rate  that  a  channel  of  e  given  bandwith  oan  accomodate. 

The  limiting  oapaeity  c  (b/s)  of  e  channel  with  a  bandwith  »  and  the 
■ignal-to-nolse  ratio  (IMH)  ia  given  by  Ihannon's  theory  of  communication  ae 
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A  typical  analogue  telephone  channel  with  B-3  kHa  and  8NR-30  dB  would 
therefor*  hava  COO  kb/a,  A  modulation  uyatem  with  this  performance  haa  yat 
to  be  deviaed  however. 

Malting  performance  for  apeach  coding  may  be  calculated  aa  followa.  in 
the  English  language  there  are  42»2S*4  diatlnct  aounda  called  "phonemes" 

(26)  ,  and  normal  apeech  ia  baaically  a  continuoua  proceaa  of  interpolation 
botwean  these  aounda.  A  normal  talker  uttera  about  ten  phonamea  per  aecond 

(27) ,  and  the  baaic  information  of  apaaoh  (the  Information  rate  of  the 
written  equivalent  of  the  worda  apoken)  ia  thus  only  about  S.4xlO«S4  b/a  if 
ona  allowa  for,  aay,  1360  variationa  on  the  baaic  phonamea  to  accommodate 
different  dialeota  and  peraonal  characteriatica,  then  the  total  number  of 
aounda  la  42xlS60a21*,  and  if  ona  allowa  for  a  very  faat  talker  uttering,  aay 
40  phonamea  per  aeoond,  than  the  information  rate  ia  atill  only  about 
40xl6>640  b/a.  Thara  la  thua  a  large  diacrapanoy  between  the  data  rate 
required  for  a  good  quality  PCM  ayatem  and  tha  rate  at  whioh  real  information 
in  transmitted.  The  ataqa  of  development  at  proaent  la  auoh  that  it  may  aoon 
ba  possible  to  aend  high-quality  digital  apamoh  signals  at  about  8  kb/a  over 
a  wide  range  of  ohennale.  Robust,  high-quality  coding  algorithm*  will  out  the 
bit  rate  and  naw  modulators  and  demodulator!  will  tranamlt  tha  lowar 
bit-rata,  with  a  low  bit-arror  probability  over  an  analogue  channel  having  e 
bandwith  of  about  1  kHa.  Analogue  voioa  link  ,  now  used  for  transmitting 
high-quality  analogue  apaaoh  will  therefore  be  able  to  carry  high-quality 
digital  apeech  with  added  benofite  aa  voice  security. 

Thia  rathor  lengthy  preala  on  apeech  coding  whioh  ia  given  hare  baaausa 
of  tho  central  role  of  the  aubjaot  in  tha  whole  apeach  processing  fluid  will 
ba  elaborated  on  and  expanded  by  Prof  Oeraho  in  hia  leoture  on  "Speeoh 
Coding". 

4.3  Speeoh  Synthesis 

Speech  aynthaaia  involves  the  aonveraion  of  e  command  sequence  or  input 
text  (worda  or  aentenoea)  into  apeech  waveform  ualng  algorithms  and 
previously  coded  apeach  data.  Tha  text  car,  ba  entered  by  keyboard,  optical 
character  recognition,  or  from  a  previously  atorad  data  baaa.  Speech 
synthesiser  can  be  characterised  by  tha  alaa  of  tha  speech  units  they 
concatenate  to  yield  tha  output  apaaoh  aa  wall  ae  by  tha  method  used  to  code, 
store  and  synthesise  tha  apaaoh.  barge  apaaoh  unite,  auoh  aa  phrases  end 
aentenoea  oan  glva  high-quality  output  apaaoh  (with  large  memory 
raqulramants) .  Efficient  coding  methoda  reduce  memory  needs,  but  uaually 
degrade  apaaoh  quality. 

Syntheaiaara  can  ba  divided  into  two  ulassasi  taat-to-speaoh  ayatsms 
which  oonatruatlvaly  aynthaalaa  speech  from  tent  uiing  email  apaaoh  unite  and 
extensive  linguiatic  processing,  and  voioa  response  syatsma  which  reproduce 
speeoh  dlreatly  from  previously-coded  apaaoh,  primarily  using  signal 
processing  techniques.  Voice  response  systems  are  often  called  "speech 
coders"  and  contain  both  an  analyser  and  a  synthesiser. 

Syntheaiaara  oan  also  ba  classified  by  how  they  parametrise  speech  for 
storage  and  synthesis.  High  quality  aystama  with  large  memory  capacities 
aynthealae  speech  by  recreating  tha  waveform  sample-by-aampla  in  the  time 
domain.  Mora  affioiant  (but  lowar  quality)  syatsma  attampt  to  recreate  the 
frequency  spectrum  of  tha  original  apaaoh  from  a  parametric  repreaantation.  A 
third  poaalblltiy  ia  direct  simulation  of  the  vocal  tract  movements  using 
data  derived  from  X-ray  analysis  of  human  production  of  specified  sound 
sequences. 

Duo  to  the  difficulty  of  obtaining  accurate  three  dimensional  vooal  tract 
representations  modeling  the  system  with  a  limited  aat  of  parameters,  this 
last  method  uaually  yields  lowar  quality  apeech  and  hae  yet  to  have 
commercial  application. 

The  eimplast  aynthaelsera  conoatanata  stored  worda  or  phraaas.  Thia 
method  yielda  high-quality  apaaoh  (depending  on  tha  aynthaaia  method)  but  ia 
limited  by  tha  need  to  atora  in  computer  (read-only)  memory  all  tha  phrases 
to  ba  aynthaaiaad  after  they  hava  baan  apoken  either  in  isolation  or  in 
carrier  aontancaa.  Per  maximum  naturalness  In  tha  synthetic  apaaoh,  each  word 
or  phraaa  must  originally  ba  pronounaad  with  timing  end  Intonation 
appropriate  for  all  aantanoaa  in  which  it  could  ba  uaad. 

Hybrid  synthesisers  conoatanata  intermadlate-aised  units  of  atorad  speech 
such  aa  syllables,  damiayllablaa,  and  diphones,  using  smoothing  of  special 
parameters  at  tha  boundaries  between  unita.  To  further  enhanoa  tha 
flexibility  of  storad-apeaoh  aynthaaia  ayatsms,  ona  oan  allow  control  of 
prosody  (pitoh  and  duration  adjustments)  during  tha  aynthaaia  process.  With 
tha  decreasing  coat  of  digital  storage,  atorad -apaaoh  aynthaaia  techniques 
could  provide  low-ooat  voioa  output  for  many  applications. 
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It  is  dear  that  atored-spaoch  syatama  aca  not  flaxible  anough  to  convart 
unraatrlctad  English  (or  whatavar  languaga)  text  to  apaach.  A  taxt-to-apaach 
ayatam  that  usaa  synthaais-by  rula  is  naadad  for  applications  auoh  aa 
accaaaing  alactronic  mail  by  voica,  a  reading  machina  ato.  Tha  taxt-to-apaach 
ayatam  must  convart  incoming  text/  such  as  alactronic  mail#  that  often 
includes  abbreviations,  Roman  numerals,  dates,  times,  formulas,  and  a  wida 
variety  of  punctuation  marks  into  some  reasonable,  standard  form.  The  text 
must  be  further  translated  into  a  broad  phonetic  tranaoription.  How  this  ia 
done  and  other  aspects  of  speech  synthesis  ara  explained  in  tha  lecture  by  Dr 
Flanagan. 


Thera  are  several  commercial  taxt-to-apaach  ''inversion  syatama  in  tha 
market  which  coma  in  board,  peripheral,  software  or  a, ’item  form  (28).  They 
ara  mostly  for  English  adult  mala  but  aome  do  adult  female  and  ohild  voice. 
The  speech  mods  used  is  mostly  worda  with  aoma  accepting  also  letters.  Tha 
synthasis  technique  employed  is  mostly  formant  synthesis  but  aoma 
manufactures  uaa  LPC.  Prices  vary  from  a  few  hundred  Dollars  for  software  to 
a  few  tana  of  thousand  Dollars  for  syatama.  Tha  quality  of  avan  tha  beat 
systems  is  such  that  during  testa,  listeners  understood  tha  synthetic  apaach 
produced  91. 7%  of  the  time  compared  with  99.4*  for  human  speech.  Research  in 
text-to  speech  synthesis  which  concentrates,  at  present,  on  producing  speech 
that  sounds  more  natural,  is  expected  to  provide  systems  which  ara  more 
flexible  for  salaoting  tha  speaker  < .  hnr.  oteristioe,  different  languages  and 
their  dialects,  and  regional  variabi.'  H  vea. 


4.4  Speech  Recognition 

Of  all  the  speech  processing  techniques,  speech  recognition  is  the  most 
intractable  one.  The  ultimata  objective  of  most  research  in  this  area  is  to 
produce  a  machine  which  would  understand  conversational  speech  with 
unrestricted  vocabulary,  from  essentially  any  talker.  We  are  far  from  hia 
goal . 

The  reason  why  automatic  apeaoh  recognition  is  such  a  difficult  problem 
can  ba  stated  very  briefly  under  four  problem  araasi  First,  the  speech  signal 
is  normally  continuous  and  there  are  no  aouatio  markers  which  identify  the 
word  boundaries.  Second,  apeaah  signals  are  highly  variable  from  parson  to 
person  and  even  in  one  and  tha  aama  person  depending  on  his  state.  The  third 
problem  area  is  ambiguity  which  is  characterised  by  conditions  whereby 
patterns  which  should  ba  difference,  end  up  looking  alike.  The  fourth  problem 
area  results  from  the  faot  that  the  apeaah  signal  is  a  part  of  tha  complex 
system  of  human  language  where  it  is  often  the  intention  behind  a  message 
that  is  more  important  than  the  message  itself.  Therefore  an  advanced  speech 
recogniser  would  be  expected  to  incorporate  techniques  which  would  enable  it 
to  use  the  meanings  of  words  in  order  to  interpret  what  has  been  said. 
However,  there  are  several  applications  which  do  not  require  this  full 
oapibillty.  They  range  from  voice  editors,  and  information  retrieval  from 
data  bases  to  basic  English  and  large  vocabulary  systems  required  for  office 
dlctation/word  processing  and  language  translation. 


A  technology  that  is  closely  related  to  speech  recognition  is  speaker 
recognition,  or  automatic  recognition  of  a  talker  from  measurements  of 
individual  characteristics  in  the  voice  signal.  Tha  two  tasks  that  are 
relevant  here  are  "absolute  identification"  and  "talker  verif ioation"  the 
former  being  the  more  difficult  to  perform.  An  interesting  military 
application  of  speaker  recognition  is  related  to  the  monitoring  of  enemy 
radio  channels  with  a  view  to  Identifying,  perhaps  in  conjunction  with 
keyword  recognition,  critical  situations  before  they  occur. 

The  recognition  problem  has  at  least  three  dimensional  voosbulary  else, 
speaker  identify  and  fluenay  of  input  speeah  and  the  performance  of  speech 
recoanisers  also  depend  on  the  aaoustic  environment  and  transmission 
conditions.  Currant  understanding  parmits  building  practical  systems  thst 
rsliably  racogniie  savarsl  hundrsd  words  spoken  by  a  person  who  trainsd  ths 
system.  Recognition  for  any  or  all  speakers  requires  about  tan  times  mors 
computation  then  for  individuals  whose  vocabulary  patterns  have  bsen  stored. 
Recognition  of  eingle  worde  or  abort  phrases-apoken  in  isolation-can  be  done 
reliably,  even  over  dialad-up  telephone  channels.  Recognition  of  oonnaoted 
worda  in  under  active  development.  Recognition  of  ooiivereetionel  fluent 
epeeoh  is  in  fundamental  research,  and  advanosa  atrongly  depend  on  good 
computational  models  for  syntax  and  semantics. 

Dr  Rsbinsr  reviews  and  discusses  in  his  lecture  tha  general  pattern 
recognition  fremework  for  machine  recognition  of  speeoh  including  some  of  the 
signal  processing  and  statistical  pattern  recognition  aspect.  He  oommenta  on 
the  performance  of  current  syatama  and  also  on  ths  way  ahead  in  this  very 
challenging  area.  He  ahowe  that  our  understanding  ia  beat  for  the  simplest 
recognition  task  end  is  considerably  lasa  well  developed  for  large  aoala 
recognition  ayatams. 
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5.  QUALITY  BVALUATIQH  HBTB0B8 

Thera  are  different  end  ea  yet  generally  not  atandardited  methods 
(subjective  and  objeative)  to  meacure  the  "goodness"  or  "quality*  of  apeach 
processing  ayatema  in  a  formal  manner.  The  methoda  are  divided  into  three 
groups  t 

-  Subjective  and  objective  aaaeaament  for  apeeoh  coding  and  tranamieaion 

ayatema. 

-  lubjeotive  and  objective  quality  meaaurea  for  apeeoh  output  ayatema 
(synthesisers) * 

-  Aaaeaament  methods  for  automatic  apeeoh  recognition  ayatema. 

or  ateeneken  dlaouaaea  in  hia  lecture  aaaeaament  methoda  for  these  three 
groups  of  apeach  processing  ayatema.  The  firat  two  oyatama  require  an 
evaluation  in  terms  of  intelligibility  maaeurea  white  the  evaluation  of 
apeeoh  reoognlaers  require a  a  different  approach  aa  the  recognition  rate 
normally  depends  on  reoogniaar*apeeific  parameters  and  external  tactora. 
However,  more  generally  applicable  evaluation  methods  such  aa  predictive 
methoda  are  alao  becoming  available.  Por  military  applieationa  it  la#  of 
course,  neoeasary  to  include  into  the  teat  method  the  effects  of  the 
environmental  conditions  auch  aa  noise  level#  acceleration#  atreaa#  maak 
mlorophonea  etc.  It  la  emphasised  that  evaluation  techniques  are  orucial  to 
the  satisfactory  deployment  of  apeeoh  proci.slng  equipments  in  real 
applications. 

Because  of  its  widespread  use  and  to  define  aome  important  apeeoh 
parameters#  a  aubjeetlve  asaeament  method#  which  la  generally  used  to  measure 
the  perceived  speech  quality  of  a  apeach  coder#  is  outlined  below. 

The  term  "quality"  la  a  general  term  combining  many  different  attributas# 
and  there  are  many  waya  in  which  theae  can  be  aaaeaad.  The  most  important 
attributes  contributing  to  apeeoh  quality  arei 

-  intelligibility  (a  measure  of  "underatandableneaa") 

-  articulation  acore  (  a  measure  of  phoneme  recognition) 

•  speaker  identification 

m.  intumibimr 

The  moat  well  known  technique  for  measuring  Intelligibility  la  the 
Harvard  Teat  ( 29 > .  Thia  taat  constats  of  transmitting  list  of  phonatically 
balanoad  (PB)  words  through  tha  apeeoh  coder  under  teat#  and  measuring  the 
proportion  of  worde  correctly  perceived.  The  PB  word  lieta  consist  uf 
isolated  but  meanlnful  worda>  they  are  seleotad  in  auch  a  way  that  each 
phoneme  contained  in  the  liat  haa  tha  aame  probability  of  occurrence  aa  it 
haa  in  normal  oonvaraational  aptaoh. 

An  alternatlva  method  of  measuring  intelligibility  ia  to  uaa  meaningful 
aentanoaa  rathar  than  PB  word  list.  Tha  paroentaue  of  worde  oorreatly 
perceived  then  gives  a  measure  of  intelligibility .  Not*  that  tha 
intelligibility  when  aentencea  are  used  in  higher  than  that  obtained  by  using 
PB  word  Hats  because  the  meaning  associated  with  sentences  gives  perceptual 
clues  to  tha  listener  and  these  oluea  are  not  available  with  PB  word  lists. 
Whan  reporting  intelligibility  acoraa  it  ia  therefore  Important  to  specify 
which  type  of  test  was  used  and  under  what  conditions  it  waa  conducted. 


The  intelligibility  taafce  outlined  in  the  previous  section  measure  tha 
degree  of  apeach  understanding  available  with  a  particular  apeach  coder.  It 
the  Intelligibility  la  high#  however,  <«.g.  more  than  90%  1  than  tha  testa  are 
not  vary  sensitive  to  small  differences  between  different  types  of  coder.  A 
more  sensitive  teat  is  to  measure  tha  articulation  aoore  instead  of  the 
intelligibility.  The  increase  in  sensitivity  could  be  achieved  by  using 
logatoms  (  i.e.,  nonaenaa  syllables)  inataad  nr  words  (10).  The  chosen 
logatome  could  ba  phonetically  balanoad  for  all  phonemes  or  for  the 
consonants  only.  Tha  articulation  score  derived  from  a  consonant  recognition 
taat  (CRT)  using  the  latter  type  of  logatom  would  Perhaps  give  tha  most 
meaningful  intelligibility  measure  for  military  applications#  baoauae  the 
main  oluea  in  perception  are  derived  from  consonants  rather  than  vowels.  An 
llluatraotion  of  this  la  the  sentenoe 

-a-  -ou  u—e— a—  — -l- 

in  which  all  the  consonants  have  been  replaced  by  e  hyphen.  It  ie  not  very 
meanlngfull  If  the  opposite  condition#  in  which  all  the  vowels  inataad  of 
tha  consonants  are  replaced  by  a  hyphen,  la  now  applied  to  tha  aame  sentence 
one  haa 

o-n  y--  -nd-rat-nd  th-a 

which  is  much  more  meaningful.  The  full  sentence  is  of  oouraei 
oan  you  understand  this 


5.3  Speaker  Identlf lcetlon 

Tho  Ability  of  •  speech  coder  to  transmit  the  characteristics  of  a 
speaker's  voice  in  such  a  way  that  a  listener  can  identify  who  is  speaking  is 
another  attribute  contributing  to  peraeived  speech  quality.  In  a  military 
environment,  this  is  an  important  attribute  because  of  the  "noed-to-know" 
principle. 


There  is  basically  only  one  method  for  measuring  the  speaker 
identification  capability  of  a  speech  coder  and  that  is  simply  to  use  a 
number  of  different  speakers  and  intruot  the  listeners  to  identify  which 
speaker  they  think  they  are  hearing.  The  percentage  of  correct  estimates  then 
yields  a  measure  of  the  speaker  identification  ability  of  the  particular 
speech  coder  under  test. 

JLdJBMtlito 

The  combined  effects  of  tho  attributes  outlined  in  the  previous  three 
sections  (l.e.,  intelligibility,  articulation  score,  and  speaker 
identification)  can  best  be  measured  by  conducting  "user  opinion  testa"  (31). 
Such  testa  simply  consist  of  intructing  a  pair  of  users  to  discuss  a  given 
problem  for  a  certain  period  of  time  via  the  speech  coder  under  test,  and 
then  to  ask  them  to  classify  their  opinion  in  terms  of  a  five-point  scale 
given  in  Table  III  below.  The  results  obtained  from  user  opinion  tests, 
averaged  over  a  large  numbar  of  users,  yield  an  indication  of  the  overall 
speeoh  quality  of  the  speech  coder  under  test. 

Table  Mil  rive-Point  Adjectival  Seals  for  Quality 
Impairment  end  Associated  Number  Ooorea 


MMiBteiiJlaa&ii 

3 

4 
3 
2 
1 

5.3  Measurement 


flMiUfaL. 

Excellent 

aood 

Pair 

Poor 

Unsatisfactory 


Imperceptible 

Perceptible  but  not  annoying 
Slightly  annoying 
Annoying 
Very  annoying 


An  alternative  method  for  quantifying  the  "goodness"  of  a  opeech  aodar, 
other  than  assessing  the  rather  ill-defined  concept  of  speech  quality  is  to 
measure  its  electrial  characteristics.  Important  characteristics  whioh  could 
bo  measured  inaludei 


-  attenuation  frequency  distortion 

-  signal -to-nolse  ratio 
•  dynamic  range 

-  idle  ahannel  noise 

-  quantisation  noise 

-  susceptibility  to  transmission  errors 

-  harmonic  distortion 

-  group  delay  distortion 


In  order  to  oombina  the  results  of  such  measurements  into  a  single  entity 
indicating  tha  "goodness"  of  the  coder,  an  "articulation  index*  (AI)  could  be 
computed  (31).  if  so  desired,  this  index  might  then  be  directly  related  to 
either  an  articulation  saare  or  an  intelligibility  aaaesment,  The  validity  of 
such  a  relationship  and  the  method  used  for  calculating  the  Al,  ire  still 
topics  of  reaearoh  and  development  but  very  promising  results  have  already 
bean  aahieved  (11) . 


kjm  msuLumit 

In  discussing  Speech  Processing  techniques  one  must,  of  course,  be  fully 
ewsre  of  end  take  into  aocaunt  how  humans  generate  tha  speech  signal,  how 
they  perceive  it  and  the  process  of  speech  communication  itself. 

Or  Hunt  daals  in  hie  lecture  with  these  subjects  which  underpin  ail  the 
other  laoturss  and  shows  the  problem  areas  with  which  tho  researchers  in  the 
speech  processing  eras  era  faced.  He  presents  speech  communication  as  sn 
interactive  process,  in  whioh  the  lietener  actively  reconstructs  the  massage 
from  a  combination  of  aaouatic  cuss  and  prior  knowledge,  and  tho  speaker 
takas  tha  listener's  capacities  into  socount  in  deciding  how  much  acoustic 
Information  to  provide. 


•IK 
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Speech  communication  la  and  will  remain  In  the  foreseeable  future  the 
main  mode  of  communication'  not  only  tor  civil  but  alao  for 
etrateylo/taotiosl  military  application*.  Digital  speech  processing  l.a» 
consequently'  an  essential  ingradient  of  the  evolving  ISDN's  to  b*  used  by 
both  civil  and  military  users.  A  fully  implemented  ISDN  is  seen  as  a  real 
asset  to  national  saeurlty  and  preparedness.  Bnd-to-end  digital  connections 
of  the  kind  promised  by  ISDN  are  well  suited  to  aeoure  communications, 
furthermore,  the  ubiquity*  connectivity,  end  interoperability  inherent  in  the 
oonoept  will  be  moat  valuable  in  emergency  situations  requiring  reconfigured 
communications. 

Speech  coding  methods  have  been  stenderdiaed  internationally  at  44  kb/e 
(PCM)  end  32  kb/s  (AllPCHI  and  oodara  at  thaaa  rates  era  being  used  in  the 
common-user  switched  telephone  networks.  Continuously  Variable  Slope  Delta 
Modulation  (CVSD)  hse  alao  been  standardised  in  NATO  for  tactical  military 
communications'  Thera  are  alao  both  olvil  end  military  requirements  tor 
apeaoh  coders  operating  at  speeds  of  11  kb/e  end  below  a.g.,  for  mohilo  lend 
end  maritime  oommunlaatione.  Por  HP  oommunloetlons  and  bOl  radio  and 
aatolllta  communications  under  heavy  lamming*  vocoder*  operating  at  l>4  kb/e 
and  avan  below  ere  required'  Seeure  voice  using  4  kHl  nominal  analogue 
channels  also  requires  spesch  oodara  operating  at  speeds  ot  4.4  kh/a  and 
below,  speech  ooding  is  also  required  cor  high-fidelity  voice  (HPV)  with  7 
and  IB  kill  bandwlth  as  well  as  for  Digital  circuit  Multiplication  ai  x  for 
longar-tarm  application**  l.a.*  in  tha  evolving  broadband  ISDN  when 
"Asynchronous  Transfer  Mode"  (ATM)  of  operation  will  be  implemented. 

It  is  to  be  noted  that  there  are  important  operational  requirements  in 
NATO  for  interopsrabl lity  between  systems  using  dlftersnt  spesch  codsrsi  this 
nscsssitates  standsrdisstion  and  aqraamanta  un  intsrfacss/gataways  whara 
coda*  rats  and  othar  (signalling*  numbering)  conversions  tska  pises. 

To  aahiavs  good  quality  balow  32  kb/a  uodaa  must  take  increasing 
sdvantaga  of  tha  constraint*  of  apeaoh  production  and  perception.  At 
transmission  ratea  balow  14  kb/a  quality  diminishes  algnif lcantly,  requiring 
more  of  the*  ea  yet*  poorly  known  properties  of  apeech  production  and 
perception.  Alao  at  the  lower  transmission  rates,  tha  computational 
ooiHpleklty  to  implement  the  ooding  algorithms  increases,  while  tha  ability  to 
handle  nonapeeoh-llke  aounde-auch  as  music  and  vnloe-band  dsfcsdlminlsnsx. 
Typically  too,  the  enooding  delay  Increases  ea  tha  tranemiaaion  bit  rate 
deoraaeaa, 

The  primary  ululienge,  than  is  to  develop  new  understanding  that  will 
algnif laantly  elevate  the  apesuh -quality  curve  for  tha  lower  bit  retail,  even 
with  subatanMal  but  acceptable  Increase  in  oompieMity, 

The  research  frontier  In  ooding  currently  centers  on  ways  to  achiave  good 
quality  at  transmission  raise  of  4.4  kb/s  and  below.  Undoubtedly,  increased 
computational  complexity  will  be  required  to  elevate  the  quality  of  low 
bit-rate  codea,  which  must  extensively  use  the  known  redundancies  of  apseah 
production  end  perception.  Breakthroughs  will  occur  only  when  new  properties 
of  radundanc"  arc  found  (14). 

in  addition  to  apaaoh  ooding,  there  era  evolving  Commend  and  Control 
requirements  for  speech  synthesis  and  ipsaoh  recognition  systems  on  tha 
ground  as  well  ae  in  the  cockpit  involving  voice  storage,  voice  response, 
voioe  control,  speaker  authentication/ recognition  sto.  These  systems  are 
expected  tc  find  important  applications  also  in  the  civil  networks  134) i 
telephone  answering'  remote  sooeea,  voioe  mall,  speaker  verification  sto. 

In  apeaoh  synthesis'  first  systems  for  unrestricted  text-to-spssoh 
conversion  era  producing  useful,  intelliglb.a  synthetic  speech  but  of  limited 
naturalnees.  Over  the  next  five  yearn,  work  already  in  progress  aims  to 
ptoduoe  high-quality  synthesis  from  text*  where  different  voioe  qualities 
(such  as  men,  woman,  ohlld)  might  be  specified.  Also,  synthesis  from  text 
might  be  reeliaed  tor  languages  that  are  quite  different  from  Western 
lanouaqee.  over  the  long  term,  detailed  uunderstsndlng  may  permit  specifying 
individual  voioe  characteristics,  dialects,  and  eooenca. 

In  speaoh  recognition,  system*  for  reliable  recognition  of  isolated  words 
are  well  established  end  beginning  to  prove  their  value.  The  near  term  will 
see  speaker-independent  recognition  of  oonneoted  digits  established  and 
applied. 

Over  the  nekt  few  years,  the  technology  is  expected  to  advance  to  whole 
oonneoted  sentences,  using  limited  vocabularies  and  finite  grammars,  Over  tha 
longer  term,  understanding  of  programmed  parsers  and  natural  language 
analysis  will  allow  tha  leverage  of  syntax,  eemantlos  and  avantually,  sven 
pragmatics  to  expand  a  machine's  conversational  ability.  Ultimately, 
prsotical  spoken  language  translation  may  be  possible. 


I-IV 


While  research  and  developmant  work*  driven  by  the  advances  being  mada  in 
tha  araaa  of  microelectronics,  computer  acianoe,  and  artificial  intelligence, 
continua  vigorously  in  many  national  laboratioi iaa  on  all  aapoots  of  speech 
processing,  intarnatlonal/reglonal/national  *>> andardlaation  bodies  try  to 
promulgate  atandarda  in  ordar  to  achiava  tha  naceaaary  or  desired  degree  of 
uniformity  in  design  or  oparation  to  parmit  voice  systems  to  function 
banaflnially  for  both  providers  and  users. 

NATO*  aa  a  body  ia  involvad  in  standardisation  afforta  through  ita 
"Military  Agency  for  Standardisation"  (MAM)  which  haa  alraady  iaauad 
"Mtandardiaaeioh  Agraamanka"  MTAHAO ' a  on  3.4,  4.1  and  14  kb/a  eodara  and 
modulation  equipment.  Alao  within  NATO,  thu  mambar  countriaa  ara  angagad  in 
active  technical  coordination,  information  akohanga  and  cooperative  raawaroh 
projects  through  tha  NATO  AC/34J  Pinal  Ill  haaaaroh  droup  (RSd)-lO  for  apaaoh 
proataaing,  Tha  aetlvitaa  of  thia  droup  inoluda,  among  othar  thinga,  tha 
application  of  apaaoh  Input/output  system*  in  thu  multilingual  military 
anvlronmant,  Tha  oountricn  that  participate  in  tha  work  of  R#a«io  ara  Canada, 
Pranoa,  Oarmany,  Natharianda,  United  Kingdom,  and  tha  United  tttatea,  in  faot, 
two  of  our  lacturara  ara  member#  of  thia  droup. 

Thia  laetura  aariaa  la  a  atata-of-tha  art  ravlaw  of  apaaoh  proooaaing 
which  ia  givan  by  aciantiati  who  ara  in  tha  forafront  of  taaearoh  in  thia 
faoinating  area,  and  tha  Director  of  thia  aariaa  would  faal  gratified  If  thia 
rcauite  in  inducing  or  seducing  aoma  of  tha  attandaaa  into  thia  area  ot  work 
or  in  fartillalng  their  own  fields  of  anpartlea, 
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ANNEX 


CALCULATION  OP  3ATC0M  LINK  CAPACITY  UNDER  JAMMING 

The  total  uplink  data  rata  !}](jthat  can  ba  supported  by  a  transmitting 
SATCOM  terminal  in  tha  prasanca  of  uplink  jamming,  while  maintaining  a 
minimum  acceptable  uplink  l^/Ngls  givan  by, 
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V  <VNo>u  w.H)  PJU 
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•  SATCOM  terminal  BIRP 

•  uplink  jammer  E1RP 

■  satellite  receive  antenna 
direction 


MB, 


'SU 


gain 


in 


tha 


SATCOM 


terminal 


M  ■  aatallita  racaiva  antenna  nulling  in  the  jammer  direction 
T_  •  effective  noiae  temperature  of  satellite  racaivar 
k  -  Boltaamna  constant 
B„0  *  uplink  spreading  (hopping)  bandwidth 
uplink  free  apace  loaa 

u  •  minimum  acceptable  energy  par  bit-to-noiae  density  ratio 
after  dehopping  at  tha  aatallita 
Ny  ■  margin  for  atmospheric  and  rain  losses  at  uplink  frequency 


In  aquation  (1)  tha  aatallita  range  from  tho  terminal  and  from  tha  jammer 
(and  hence  tha  uplink  free  apace  loaaaa)  have  bean  assumed  to  ba  equal. 
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Fig.  1  Encoding  of  communications  in  an  engineering  development 
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Not*  1  -  The  ISDN  local  functional  capabilities  aorreoponda  to  func 
tiona  provided  by  a  local  exchange  and  possibly  other  equipments 
such  as  electronic  cross  connect  equipments  muldexes,  etc. 

Note  2  -  User-to-user  signalling  needs  further  study. 

Note  3  -  These  functions  may  either  be  implemented  within  ISDN  or  be 
provided  by  separate  networks. 

Note  4  -  in  certain  national  situations ,  ALLF  may  also  be  implemen 
ted  outside  the  ISDN,  in  special  nodes  or  in  certain  categories  of 
terminals. 

Note  5  -  Circuit  switching  and  non-switched  functional  capabilities 
at  rates  less  than  64  kbit/s  are  for  further  study. 

Note  6  -  For  signalling  between  international  ISDNs,  CCITT  No. 7 
shall  be  usod. 


Fig, 3  Buie  architectural  model  of  an  ISDN 
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The  Speech  Signal 
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Abstract 

This  paper  provides  a  iion-mathemaikal  introduction  to  the  speech  signal.  The  produc¬ 
tion  of  speech  is  tint  described,  Including  a  survey  of  the  categories  into  which  speech 
sounds  are  grouped.  Tnls  is  followed  by  an  account  of  sous  properties  of  human  per¬ 
ception  of  sounds  in  genual  and  of  speech  in  pedicular.  Speech  is  then  ooupered  with 
othr  signals.  It  is  argued  that  It  Is  moss  complex  thnn  artificial  message  bearing  sig¬ 
nals.  and  that  unlike  such  signals  speech  contains  no  easily  identified  context- 
independent  units  tha>  can  be  used  In  bottom-up  decoding,  Words  and  phonemes  are 
examined,  and  phonemes  are  shown  to  have  no  simple  manifestation  in  the  acoustic 
signal.  Speech  communication  Is  presented  as  an  interactive  process,  in  which  the 
listener  actively  reconstructs  the  manage  from  a  combination  of  acoustic  cues  and 
prior  knowledge,  and  tin  speaker  takes  the  listener's  capacities  into  account  In  decid¬ 
ing  how  much  acoustic  information  to  provide  Ths  ,inal  section  compares  speech  and 
text,  arguing  that  our  cultural  emphaiia  on  written  communication  causes  us  to  project 
propei ties  of  text  onto  speech  and  that  there  are  large  differences  between  the  styles  of 
language  apprapriasi  far  the  two  modes  of  uowmunication.  These  differences  are  often 
Ignored,  with  tin  fan  unate  results. 


1.  Introduction 

This  ooatrlbution  dr  Is  with  the  nature  of  the  speech  signal,  the  signal  that  allows  one  human  being  to  communicate  to  another 
whatever  message  lie  or  she  consciously  chooses  to  aapress,  with  no  external  aids  and  usually  with  very  llule  effort.  One  of  Its 
principal  aims  is  to  argue  that  speech  is  an  exceedingly  special  kind  of  signal, 

A  newly  invented  or  ..ewly  discovered  signal  can  be  approaclted  objectively,  But  we  can  all  speak,  and  tha  internal 
Impression  we  have  of  speech  cm  cloud  our  view.  1o  compound  the  problem,  most  of  us  cm  read,  and  the  Impression  that  we 
gain  of  language  Arum  printed  text  often  distorts  our  ideas  of  spoken  language.  'Hie  extent  of  these  problems  wilt  be  discussed 
In  sections  4  and  .1;  but  before  that  some  more  basic  Information  on  die  production  and  perception  of  speech  needs  to  be 
presented.  Properties  of  both  production  and  perception  are  exploited  In  almost  all  systems  for  recognition,  synthesis  and 
efficient  trwsmiieion  of  speech. 

1.  The  Production  of  Speech 

It  uny  not  be  obvious  why  the  recognition,  artificial  generation,  and  efficient  transmission  of  the  speech  signal  should  be 
helped  by  an  understanding  of  how  human*  produce  it.  We  do  not,  after  all,  need  to  know  how  a  telaprinter  signal  was  gen¬ 
erated  in  order  to  transmit  It,  decode  It  or  reproduce  it,  Arguments  for  looking  closely  st  human  speech  production  will  emerge 
towards  the  end  of  this  motion  and  is  later  sections.  For  the  moment,  we  cm  at  Isul  note  that  production  mechanisms  provide 
a  usefai  framework  for  describing  the  speech  signal, 

The  fallowing  brief  account  of  speech  production  la  simplified  In  two  ways.  Pint,  It  excludes  certain  production  mechan- 
iattu  nol  generally  found  in  major  European  languages,  and  second  it  presents  s  classical  view  of  distinctions  occurring  in  care¬ 
fully  produced  speech,  Lassr  sections  will  preaanl  examples  when  real  speech  differs  from  the  simple  description, 

The  organs  primarily  Involved  In  producing  speech  sre  the  larynx,  which  contains  the  vocal  ooeda,  and  the  vocal  tract, 
which  la  a  tube  leading  flram  the  larynx  along  the  pharynx  and  then  branching  into  the  oral  cavity  leading  to  the  lips  end 
through  the  nasal  cavity  to  the  nostrils,  Tlse  nasal  side  branch  can  be  closed  off  by  ralthtg  e  valve  at  the  beck  of  the  mouth 
called  the  uvula, 

Acoustic  energy  In  speech  can  be  generated  in  two  different  waye,  The  primary  mechanism,  known  as  voiced  excitation, 
occurs  in  the  larynx,  The  vocal  oords  open  and  close  quasi-periodically  at  an  average  rate  of  about  1 10  times  a  second  for  a 
mao  and  about  twice  dust  far  a  woman.  The  main  Instant  of  voiced  excitation  occurs  not,  as  one  might  expect,  on  opening,  but 
when  the  airflow  from  the  lungs  la  suddenly  stopped  as  the  cords  are  pulled  together  by  Bernoulli!  forces.  The  resulting  voiced 
speech  rounds  include  ell  vowels  (unites  whispered)  rod  many  consonant  sounds:  die  words  Homan,  yellow,  and  wiring,  for 
example,  an  composed  entirely  of  vetoed  rounds. 

The  second  mechanism  for  generating  acoustic  energy  in  speech  use*  turbulence  resulting  from  s  constriction  united  by 
the  longue  or  Ups,  Sounds  generated  purely  in  this  way  (such  as  the  "»"  and  "ft"  in  soft)  are  said  to  be  voiceless,  and  they 
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generally  piny  *  leu  Important  role  in  speech  than  voiced  loundi, 

The  two  excitation  mechaniama  juat  described  can  occur  almultaneoualy,  aa  they  do  in  the  Initial  round*  of  tip  and  vat.  In 
Engtiah,  at  leut,  aounda  with  both  kinds  of  excitation  conatitutc  the  smallest  of  the  three  ciaaaea. 

As  we  have  already  aeen,  vowel  aounda  are  voiced.  They  are  produced  without  any  obatruction  in  the  oral  cavity,  If  the 
branch  to  tha  naaal  tract  ia  open,  the  vowel  ia  aaid  to  be  nasalised  (aueh  u  the  vowela  in  the  French  worda  bon,  tutu,  /aim, 
etc.).  Vowela  can  be  further  divided  into  ao-callod  pure  vowela,  which  can  be  produced  in  isolation  with  a  elation  ary  vocal 
net,  and  diphthong t,  (euoh  aa  in  the  worda  toy,  tow  and  sigh)  where  ■  movement  of  the  articulator*  (the  tongue,  lipt  or  Jaw)  ia 
nroeaeary. 

Conaonanta,  on  the  other  hand,  alwaya  involve  a  narrowini  in  the  oral  tract.  At  one  extreme,  the  narrowing  may  roault  in 
total  obatruction.  Sound*  Involving  tuch  total  obatruction  come  under  the  general  heading  of  rtopa,  though  the  term  eaoora- 
paaae*  taro  distinctly  different  aeta  of  sound*.  If  the  mm)  branch  la  open,  vetoed  excitation  produce*  natal  conaonanta  such  as 
the  (Inal  aounda  in  Jim,  ale  and  ting,  If  the  natal  branch  ia  closed,  no  air  can  flow  from  the  lung*.  Pretaura  build*  up,  and 
when  the  oral  cloture  it  released,  the  resulting  turbulent  airflow  produce*  plosive  consonant.  Example*  of  voiceltti  plosive* 
occur  at  the  beginnings  of  the  words  pin,  tin  and  kin,  while  example*  of  corresponding  voiced  atop*  occur  in  bun,  done  and 
gun.  When  voiceless  plot! vet  are  followed  by  a  vowel  or  other  voiced  sound,  voicing  must  begin  at  tome  instant  after  the 
release  of  the  cloture,  In  mott  dialect*  of  Bngllah,  the  turbulence  caused  by  the  narrow  opening  Just  after  release  of  the  cloture 
it  followed  by  a  period  of  (bout  lOOma  of  airflow  through  the  larynx  with  light  turbuienoe,  This  is  known  aa  aspiration,  and 
resemble*  the  Initial  aounda  In  word*  like  hop,  hip  and  hat,  On  the  other  hand,  in  moat  dialect*  of  French  vocal  cord  activity 
in  voioatoas  atop*  begin*  at  an  Instant  clot*  to  the  release,  without  an  In  ravening  period  of  aspiration,  In  voiced  plotivei, 
vocal  oord  activity  oan  begin  at  tha  instant  of  release  or  during  the  closure  ai  pressure  in  the  oral  cavity  ii  built  up.  Onaet  of 
voicing  In  voloed  plosives  again  tends  to  occur  earttor  in  French  than  ia  English.  Although  then  is  no  airflow  through  the  lips, 
tome  low  frequency  sound  eecapea  through  the  walls  of  the  vooal  tract  when  there  ia  voicing  during  closure. 

As  we  have  teen,  airflow  through  a  oonilriclton  cause*  turbuienoe.  Whan  this  process  is  steady,  the  resulting  sound  Is 
known  aa  a  fricative,  either  vokelei*  (a*  in  the  Initial  aoundi  of  fat,  sip  and  thick)  or  voiced  (ai  in  the  corresponding  sounds  in 
vat,  tip  end  the). 

When  the  vooal  tract  ia  narrowed  but  not  enough  to  cause  turbulence  a  claw  of  consonant  aounda  such  u  tha  initial 
sounds  to  way,  ray  and  lay  1*  produced.  They  are  lumped  together  under  tha  general  heading  of  sonorantt , 

This  survey  of  ipeech  sounds  is  incomplete  even  for  English,  but  ll  covers  the  main  categories.  We  can  now  go  cm  to 
took  briefly  at  the  aoouittoi  of  ipeech  production. 

Whether  the  excitation  in  a  ipeech  sound  is  votoed  or  voiceless,  tha  acoustic  signal  generated  by  tha  excitation  is  modified 
by  the  reeocunt  structure  of  the  vooal  tract,  which  behaves  is  an  acoustic  tube  along  which  planar  propagation  of  sound  wavea 
occurs.  Differences  in  tha  oroM-sectioaal  ana  along  tha  length  of  the  tuba  causa  reflections,  and  it  is  these  reflection*  that  give 
rise  to  the  resonance!  or  Jomantt.  The  resonant  structure  therefore  depend*  on  the  position  that  the  tongue,  lips  and  Jaw  are 
In. 

The  generation  of  the  excitation  and  its  spectral  modification  by  the  vocal  diet  turn  out  to  be  largely  independent  of  each 
other.  To  a  good  approximation,  they  can  therefore  be  oonildered  aa  a  source  isolated  from,  and  leading  Into,  a  linear  Alter, 

The  upper  trace  of  Figure  1  show*  a  20mi  stretch  of  tha  waveform  of  a  non-naaaliaed  vowel  (strictly,  It  Is  tha  time- 
differenced  waveform:  differentiation  provides  a  6db  per  octave  lift,  which  serves  to  flatten  the  long-term  spectrum  for  voiced 
speech),  Notice  that  the  waveform  consists  of  a  pattern  that  repeats  itself  at  regular  intarvali.  The  repetition  ml"  is  the  rate  at 
which  the  vooal  cords  come  together  —  the  fundamental  frequency  of  this  speech  sound  —  while  the  repeating  pattern  Itself  is 
the  response  of  the  vocal  tract  to  this  periodic  excitation, 

The  lower  trace  in  Figure  1  shows  the  excitation  with  the  effect  of  the  vocal  tract  removed,  The  impulse-llkc  excitation 
occurs  each  time  the  vocal  cords  oome  together  and  close  off  the  airflow  from  the  lungs.  In  the  particularly  simple  vowel 
shown  here  (the  "neutral"  vowel  ooeurring  in  a  word  luoh  as  the  standard  British  English  pronunciation  of  bird)  the  impulse 
travel*  from  the  larynx  to  the  lips,  when  pert  of  it  is  radiated  into  the  open  air  beyond  end  part  ia  reflected  back  towards  the 
larynx  with  lie  polarity  ravened,  At  tha  larynx  the  signal  is  re  flee  tad  again,  thin  time  without  polarity  reversal,  and  it  continue* 
to  bounce  between  larynx  and  lip*  steadily  losing  anergy  by  absorption  in  the  walls  of  the  vocal  tract,  by  absorption  below  the 
vocal  cords,  aaJ  by  radiation  to  the  outside  world,  until  the  next  exclusion  impulse  comes  along,  The  pattern  of  an  Impulse 
enraging  with  alranating  polarity  oan  be  aeen  In  the  upper  trace  of  Figure  I,  The  Impulse  gets  rounder  as  timo  progreeses 
because  high  frequency  components  era  tost  faster  then  low  frequency  components. 

Figure  2  shows  the  power  spectrum  of  a  section  of  speech  waveform  like  the  one  In  Figure  I,  The  regularly  spaoed  spikes 
occur  at  each  integer  multiple  of  tha  fundamental  frequency  of  the  excitation,  and  are  Harmonki  of  tha  fundamental.  The  Inten¬ 
sity  of  d*e  kanmtoe  is  determined  by  tha  produet  of  two  f*  tors.  The  ftrst  Is  the  spectrum  resulting  from  the  details  of  the 
airflow  through  the  larynx  from  one  otoeur*  of  the  vooal  oosd*  to  the  aexti  and  tha  second  is  the  spectrum  corresponding  to  the 
Impulse  teupooM  of  the  vooal  Mot. 

Let  us  look,  as  tha  leryngial  component  of  the  spectrum  first.  IVtli  component  is  generally  smooth,  and  above  a  few  hun¬ 
dred  Ham  it  daoUoat  at  about  12  dB  per  octave.  To  straw  extent,  howevw,  this  decline  is  counterbalanced  by  a  6  dB  per 
octave  rice  due  to  tha  effect  of  ndiadoa  from  the  mouth,  giving  ■  net  decline  of  tha  excitation  spectrum  of  voiced  speech  of 
arottad  6  dB  par  octavo,  An  Impulse  has  a  flat  power  spectrum,  so  in  older  to  make  the  excitation  signal  Impulse-like  its  spec¬ 
trum  new  be  made  roughly  flat.  This  ia  why  Figure*  1  and  2  used  differaatiatad  speech,  The  impulse  response  of  the  vocal 
tract  drat  appeal  directly  in  the  upper  trace  of  Figure  I. 

The  aura  shape  of  dw  vetoed  iralteflai  spectrum  varies  from  individual  to  iadividual  and  changes  with  the  intensity  of 
the  speech  and  the  mood  of  tha  apwetkrar,  la  moat  IsaguegM,  however,  such  changes  an  not  used  to  carry  information  about  the 
eaplirit  ooomot  of  a  spoken  utterance,  Incidentally,  since  women  have  "higher  pitched"  voices  than  men,  they  are  often 


Figure  1L  Th#  power  apeotrum  of  i  ilm*-dfff*r*rtcod  neutrel  vowel, 
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assumed  to  have  more  interne  high  frequency  components  in  their  voices,  If  anything,  the  reveiM  in  true:  women  generally 
have  higher  fundamental  f requeue  lei,  *o  the  harmonics  ate  further  apart,  but  the  cxciution  ipectrum  tendi  to  fall  off  more 
rapidly  in  women's  voices  than  in  men's, 

Tbi  intensities  of  the  harmonics  in  Flgun  2  show  a  series  of  smooth  peaks.  This  structure  Is  due  to  the  impulse  response 
of  the  vocal  tract,  and  the  peaks  correspond  so  formants,  Formants  are  numbered  In  order  of  Increasing  frequency.  In  the  par¬ 
ticularly  simple  sound  illustrated  in  Figure  1  the  vocal  tract  resembles  a  tube  of  uniform  cross- section.*)  area  from  the  vocal 
oordi  to  the  lips,  For  s  typical  mala  vocal  tract,  this  gives  rise  to  a  first  formant  at  about  500  Ha  and  to  subaaquom  formants 
rpaoed  1  UHU  apart  at  I.S  kHa,  2.5  kHx,  tic,  Since  the  vocal  tract  of  a  woman  la  typically  10  to  13%  shorter,  the  correspond¬ 
ing  fumnant  frequencies  are  raised  by  this  amount, 

Figure  .1  shows  a  sprewog mm  of  the  sequence  of  words  "delta  nine  one  nine,"  The  horiaontal  axis  corresponds  to  time 
and  Use  vertical  axis  to  frequency  (ton  0  to  3  kHx,  Regions  of  high  energy  appear  dark.  Since  the  analysis  used  here  has  a 
lower  frequency  resolution  than  that  in  Figure  2,  harmonics  of  the  funduuental  are  not  resolved.  The  vertical  striations 
correspond  to  (ha  excitations  caused  by  the  vocal  cords,  while  ths  broad  horiaontal  or  sloping  ban  are  formants, 

Figure  4  shows  the  second  formant  being  excited  as  the  airflow  through  the  vouei  cords  is  t topped.  The  loss  In  energy 
after  excitation  is  roughly  exponential,  though  the  rue  of  energy  loss  is  greater  whan  the  vocal  cords  are  open  than  when  they 
are  closed,  since  energy  is  absorbed  Into  the  trachea  and  lungs  during  the  open  phaat.  The  Increased  damping  during  this 
phase  also  causes  a  slight  decrease  In  the  frequencies  of  the  formants.  As  we  have  seen,  high  frequency  energy  Is  lost  faster 
then  low  frequency  energy.  Consequently,  the  higher  formants  hive  larger  bend  widths  than  the  lower  onu, 

The  excitation  In  voiceless  sounds  resulting  from  turbulence  in  the  vocal  tract  resembles  white  noise,  As  with  voiced 
sounds,  however,  redistlon  effects  from  the  lips  tend  to  reduce  the  Intensity  of  the  low  frequency  components,  and  In  voiceless 
sounds  energy  in  the  Ant  few  hundred  Ham  Is  consequently  weak.  In  voiceless  sounds,  formant  structure  Is  much  less  nuttked 
or  even  —  particularly  for  "f '  sounds  --  non-existent.  The  first  formant  Is  not  normally  excited. 

The  description  of  voiced  speech  in  terms  of  an  impulse  response  ind  the  frequency  of  the  impulses  has  several  advan¬ 
tages.  The  Impulse  response  varies  as  the  positions  of  Ihs  tongue,  jaw  and  lips  an  changed,  while  the  fundamental  frequency 
depends  on  the  muscles  that  control  the  tension  In  ths  vocal  cords  and  on  the  air  pressure  behind  the  vocal  cords.  For  the  most 
part,  changes  In  the  settings  of  the  Isrynx  end  vocal  tract  occur  slowly  relaliva  to  ths  perceptually  Important  frequencies  In  the 
speech  wivefosm,  which  era  determined  by  the  time  between  successive  reflections  of  sound  wsvts  In  the  vocal  tract.  Thus, 
while  we  need  to  sample  the  speech  waveform  at  least  eight  thousand  tlmsi  a  second  to  obtain  a  reasonable  digital  representa- 
don,  a  description  in  tarns  at  fundamental  frequency  end  a  few  parameters  describing  the  impulse  response  typically  needs  to 
be  updsted  as  Hula  u  a  hundred  or  even  fifty  times  s  second,  and  aven  then  the  change*  between  updates  tend  to  be  smell. 

A  second  mgjor  idvsnuge  of  in  ImpulM-reiponse/rundamental-ftequency  description  Is  that  the  two  factors  perform 
separate  linguistic  functions.  In  most  western  languages  tin  Identity  of  *  word  does  mu  depend  on  the  lUndamcntel  frequency 
pattern  with  which  it  is  uttered.  In  some  other  languages,  such  as  Chinese,  ths  identity  ur  a  word  may  depend  on  the  luntla 
mental  frequency  patient,  but  aven  than  an  analysli  strategy  must  still  separate  the  two  factors:  the  fundamental  frequency  pin 
tern  and  the  configuration  of  the  articulators  in  the  vocal  tract  remain  substantially  independent  attribute s  of  the  word, 

For  nominal!  red  vowels  and  sot  no  non-nasal  consonants  the  Impulse  response  of  tlte  vocal  trxut  Is  quite  accurately 
mutinied  by  a  eat  of  resonances  In  series;  diet  is,  the  vocal  tract  can  be  regarded  as  in  all-pole  lllter,  and  Its  effect  can  be  com- 
pletaly  specified  by  the  frequencies  snd  bendwldllti  of  the  polee,  corresponding  to  formants,  For  suuh  sounds,  a  technique 
known  aa  Ilnur  pmticiivt  coding  (LPC)  can  In  principle  be  used  to  determine  from  the  wavcltmu  the  frequencies  and 
handwldthi  of  the  resonances  (eta  the  book  by  Merkel  and  Urey  (I)). 

In  other  sounds,  notably  In  natal  consonants  and  nasalised  vowels,  ths  all-pole  model  of  the  vocal  Intel  Is  not  valid,  Rein 
nances  are  configured  in  parallel  tut  well  aa  in  series,  and  consequently  woes  as  well  as  pules  appear  In  the  trimsfct  function  or 
the  vocal  tract  ftltor. 


della  nine  one  nine 


Figure  3.  Spectrogram  of  the  word  sequent*  "delta  nine*  one  nine"  produoed  by  a  male  speaker, 
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Hour*  4,  so  m*  of  voloed  epeeoh  from  «  malt  apeeker,  Tha  bottom  traoa,  labeled  (.*,  la  a  maaaura  of  the 
electrtonl  Impadanoa  aoroaa  tha  larynx  and  oorralataa  alrongty  with  tha  araa  of  contact  of  tha  vooei  oorda,  with 
Inoraaalng  oonlad  balng  In  tha  downward  direction  on  tha  traoa.  Tha  traoa  abova  It.  labalad  Fa,  ahowa  tha 
airflow  through  tha  vooai  oorda.  Tha  naxt  traoa  up,  labalad  Ft,  ahowa  tha  wavaform  oorraipondng  to  tha  aaoond 
torment  only.  Finally,  tha  top  traoa.  labalad  Fa-,  ahowa  tha  Impulae-lik#  wavafomi  produoad  by  riHlarantlatlng  Fa 
twtoa.  Itila  oorrroaponda  to  tha  lowar  traoa  In  Figure  1 . 


3,  Ttw  Nreapllou  or  q peach  and  Other  hound* 

Huuuut  bearing  li  lirallar  to  ihii  uf  cloeely  related  animal*,  who,  of  couim,  do  not  uh  ipatoh.  It  do**  not  thtrrlore  appear  to 
h«va  adtpttd  to  dM  propaitk*  of  tha  ipaaoh  il|nal,  Hath*,  ipaeuh  mutt  have  avulvad  to  loti  tit*  ptuptrtiei  of  our  tenia  of 
hearing.  la  udlloUUy  ganeratinj  «p«*th  or  In  trytni  to  tranimll  It  aflkltmJy,  that*  ii  clearly  no  point  In  lUtvlni  to  reproduce 
futurM  dun  era  Inaudible,  Usually,  In  »p**ch  recognition  It  would  be  mii|ukied  to  depend  on  feature*  diet  ua  Inaudible  to 
human*,  lino*  a  tpaakw  I*  unllkaly  ic  control  feature*  that  h*  or  the  cannot  hear,  uni***  tlt*y  art  locked  to  utliir,  audlbl*, 
feature*,  lu  which  oat*  they  cany  no  Additional  InfUinalUm.  It  li  Uwrefor*  Important  ui  take  account  of  our,  admlllidly  llinliad, 
knowltd|«  of  human  lwarin|. 

Our  bnprewhei  of  tha  luudneu  uf  a  turn'd  Itu  mure  uloaaly  tha  lo|  uf  tha  acooruc  »n*r»y  rather  than  lut  linaar  vtdua. 
Thu*,  lucoeuivaly  duubllna  tha  warty  In  a  Mind  |lvai  an  impreniun  of  equal  iwpi  In  lootltwii,  and  th«  loudnaii  of  a  mmumI 
I*  normally  nwaaurad  on  the  lo|arithmlc  det/b*/  wait. 

Th*  amplitude  taniitivily  of  our  hearing  paalu  In  th*  I  to  3  klla  rente.  It  falli  off  maikadly  mnnwlwr*  below  100  II* 
and,  dtpendlnt  on  our  *|«,  tomawhare  ibov*  9  to  10  kHa, 

Th*  frequency  lenUtivlty  uf  tha  tar  can  be  tnaaiured  In  virlour  wayi  by  having  ltu*n*ri  detemdne  wbjacdvely  *qnai 
frequency  Interval*  at  dlffarenl  location*  In  tha  ipautnmc  by  tatting  iklr  etUOlv  to  d*Mvt  imall  change*  In  frequencyi  by 
raeaiuriug  th*  frequency  rang*  over  which  mwtral  wmpunenl*  Intereuij  or  even  by  direct  |«hyiioii>|kiei  meaiutetuanu  on  tlw 
Irnwr  tar.  All  tbaie  iwth-toi  lead  to  etriklagiy  tlmllar  perceptual  frequency  toaw'i,  with  wntUivity  balng  mughly  uowuant  over 
tha  Aral  faw  hundred  Hart*  tnd  then  defeating  with  utoreaalng  frequency,  Tha  perceptual  frequancy  wait  li  often  approx  I- 
maud  by  a  wale,  the  Mchrdcef  ml  reek,  that  ia  linaar  to  1  Ufa  and  logarithmic  Ihnn  0mm  on. 

Jum  a*  one  wighl  expect  front  tigmd  prooeirlng  ooiuldaraikmi,  the  dagradallon  In  frequency  raaolutlon  at  higher  liquate 
de»  li  ateoclared  with  an  Improvermn’  In  tanqwral  raaolutlon  Thli  tradeoff  U  wall  muchad  l«  the  icouitlo  prupertlt*  of 
•peauh  Aa  we  uw  In  (lection  2,  iN  higher  furtnanu  hava  lar|*  bandwidth*  wul  do  not  tltarafure  require  high  fraquancy  reto 
Itdkm,  la  vatoelau  *ouad*.  warty  tend*  to  be  wmuaMiaHd  at  high  bequanutei.  ttpeutial  Ant  itructure  la  ibawtt,  but  inch 
wundt,  pnrtkulariy  votoalan  pkwvce,  ooetaln  tvanta  that  are  ilarply  defined  In  tluw,  Voiced  round*  tharefure  require  good 
frequency  retolodon  at  low  fteiqueoel**  (below  2  kHa)  aud  voice!***  toundi  require  good  lampural  miolutlon  abova  2  klix. 

Ualaaa  two  ftoqwaoy  oooapueem*  an  within  a  oartala  critical  dlitanoa  uf  each  other  un  a  perceptual  fraquanoy  real#,  (hair 
phaaa  relatbnuhlp  ku  no  pareapttud  affect.  Contaqumtly,  •  round  can  be  luhatandally  character!**!  by  lie  powar  ipecirum, 
Ignoring  It*  phate  tpactnna. 
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Strong  frequency  component*  can  supples*  the  ear’*  response  to  weaker  component*.  In  temporal  masking,  the  strong 
component  mask*  a  weaker  component  at  the  same  or  a  nearby  frequency,  The  stronger  component  can  occur  just  before  or 
just  after  the  weaker  component,  though  the  effect  operates  over  much  greater  temporal  separations  In  the  former  case  —  so- 
called  forward  masking  —  tlian  in  the  Utter,  In  simultaneous  masking  or  frequency  masking  a  strong  component  masks  the 
presence  of  a  weaker  component  presented  at  the  same  time  at  s  different  frequency.  The  effect  decreases  as  the  frequency 
separation  between  the  components  increases,  but  the  decrease  is  slower  when  the  weaker  component  lies  above  rather  than 
below  ths  stronger  component.  Frequency  masking  therefore  operates  primarily  upwards  in  frequency. 

The  use  of  our  two  ears  allows  us  to  deduce  the  direction  of  a  sound  source  in  the  horixonttl  plane,  since  there  will  gen¬ 
erally  be  a  difference  between  the  time  of  arrival  of  an  aooustic  event  at  the  ears  that  depends  on  the  direction  from  whltf  i  it  Is 
coming,  In  addition,  the  shape  of  the  external  ear  appears  to  have  a  direction-dependent  Altering  effect  on  sounds  thst  allows 
some  directional  sensitivity  in  the  vertical  plane  and  discrimination  between  sound  sources  behind  and  in  front  of  the  listener. 
These  capacities  certainly  contribute  to  our  ability  to  follow  a  particular  conversation  in  a  crowded  room,  though  this  ability 
also  seems  to  exploit  a  more  , opto  sheared  mechanism  thst  allows  us  to  track  a  particular  voice. 

So  fer,  we  have  been  considering  the  perception  of  sounds  in  general,  Let  us  now  turn  to  consider  speech  sounds  in  partic¬ 
ular. 

Klatt  [2]  showed  thst  listeners  use  different  criteria  when  Judging  the  phonetic  similarity  of  two  speech  sounds  from  those 
they  use  when  simply  judging  the  acoustic  similarity  of  two  sounds.  For  example,  ohangea  In  the  spectral  balance  of  the  signal 
such  u  am  caused  by  manipulating  the  tone  controls  on  a  stereo  have  little  effect  on  phonetln  judgments,  This  makes  us 
immune  to  the  spectral  tilt  effects  of  the  telephone,  of  room  acoustics  and  of  shouting. 

Phonetic  Judgment*  in  voiced  speoch  turn  out  to  depend  strongly  on  the  frequencies  of  the  Ant  three  formants,  though  not 
on  ilteir  bandwidth*,  nor  on  the  details  of  higher  formants,  This  sensitivity  to  the  lower  frequency  and  most  intense  peaks  In 
the  spectrum  can  perhaps  at  least  partly  be  explalnid  by  simultaneous  masking,  which  would  tend  to  mask  the  weaker  higher 
formants  and  spectral  details  in  the  regions  between  the  lower  formants. 

At  this  point  It  might  be  interesting  to  look  at  the  extent  to  which  two  analysis  techniques,  LPC  and  mel-scale  Alter  banks, 
that  have  both  been  widely  used  In  speech  recognition  and  In  speech  transmission  incorporate  the  perceptual  properties  dis¬ 
cussed  so  far.  Both  represent  the  short  term  power  spectrum  on  a  log  scale,  ignoring  the  phase  spectrum,  If  LPC  Is  viewed  us 
a  technique  for  matching  the  power  spectrum,  it  has  the  Interesting  property  of  nut  making  a  least-squares  At  to  the  whole  spec¬ 
trum  us  one  might  expect  but  rather  of  concentrating  on  lilting  the  strong  pasts  •—  l,e.  the  formant  peak*  —  well.  On  the  uther 
hand,  conventional  LPC  Is  unable  to  reflect  directly  the  non-uniform  frequency  resolution  of  the  ear.  A  Alter  bank  can  simply 
reflect  perceptual  frequency  resolution  In  the  width  and  spacing  of  its  uhimwls,  A  hybrid  analysis  technique,  \mcepiual  linear 
prediction  (PLP)  is  able  to  combine  these  two  desirable  properties  and  has  shown  some  advantage  In  speech  recognition  (3), 

When  the  vocal  tract  Is  reopened  during  i  plosive  round  the  .’onnant  (Toque nulls  psu  through  rapid  transitions  as  the  arti¬ 
culators  Involved  In  the  cloture  move  apart,  Our  hearing  system  Is  particularly  sensitive  to  there  formant  transitions,  nml  they 
constitute  strong  cues  to  the  Identities  id  plosives, 

Uy  manipulating  such  transitions  In  synthetically  generated  speech  stimuli,  the  boundaries  between  speech  ««*«■■» 
between  "b"  and  "d"  sounds,  for  example  have  been  probed,  It  turns  out  that  consonant  sounds  are  perceived  calegoi, 
cally  (4).  That  Is,  sounds  are  not  petcelved  as  partly  "b  like"  atul  partly  "d-llkai"  rather,  they  are  |t*rvelvetl  as  fully  elder 
utte  or  tire  other,  Any  such  effect  In  the  prroeplion  rtf  vowels  is  much  less  maiked, 

We  saw  bi  the  previous  section  that  a  production-oriented  approach  etui  lead  to  an  efficient  description  of  the  speech  tig 
ntl  because  the  articulators  involved  In  speech  production  move  slowly  relative  to  lire  tlnre  between  successive  reflections  of 
sound  wsvee  In  the  vocal  tract,  If  ths  motion  of  lha  articulators  could  be  derived  directly  from  lit*  s|te*ch  waveform,  It  might 
provide  a  particularly  good  representation  for  the  perception  of  speech,  lha  Motor  'theory  of  Speech  IVtvepikm  |.1j  holds  that 
this  Is  exactly  what  human  llsietwis  do.  Allhough  the  mure  extreme  expressions  of  this  view  are  probably  less  popular  now 
than  they  once  were,  It  most  surely  have  soma  validity,  In  fluam  speech  the  tuilculaiors  rarely  reach  lire  exltente  positions 
occurring  In  speech  socials  spoken  In  Isoliduni  tether  they  take  ihott-cui*  hetwtim  the  positions  needed  lot  nelghlamiing 
sounds,  end  the  degree  of  lhe  short-cuts  depends  on  the  carefulness  and  rale  or  the  S|weclt,  It  Is  hard  to  imagine  Imw  a  i|t*eolt 
perception  mechanism  could  handle  the  acoustic  vitiations  caused  by  Ibis  behaviour  without  retorting  to  a  titrate)  of  speech 
production, 

Automatic  speech  recognition,  in  particular,  would  undoubtedly  be  helped  enormously  by  a  thorough  uttrlei standing  of 
Iruw  humans  routinely  accomplish  the  lash.  Sadly,  though,  we  are  still  fat  Ihnn  such  an  undetsltndlug,  Nome  of  the  points  die 
cussed  In  the  next  section  may  mika  the  magnitude  of  tit*  luotiiem  a  little  clearer. 

4,  Speech  ret  a  r<wN>tunkaU— » fllgnai 

Mjraeeh  la  a  aignal  with  an  Intended  meaaage,  In  this  respect  It  dlfTeta  frmtt,  say,  IlliU  signals  or  ftrutt  a  signs)  transmitted  from 
a  satellite  representing  art  Image  of  a  portion  of  the  earth,  Much  an  Image  has  the  obvious  dtlfsrencs  that  It  Is  two  dlmenilonsl 
while  the  speech  signal  is  affectively  one  dimensional,  The  more  important  dltfervnce,  Ihttugh,  la  that  tha  satelllt*  image  Is  not 
a  oummunicatiuni  It  tuntalni  infuentatlon  but  it  dues  not  contain  a  message,  'lire  very  same  Image  might  lie  used  it*  study  the 
valuation  of  an  area  ur  to  try  to  spot  missile  silos,  bui  presumably  (It#  imaga  processing  technique*  approbate  tor  tha  ihh* 
task  would  ba  quite  different  from  ihoea  approptlsla  for  lire  *4h*r,  'Huts,  image  |*mt  suing  remit  to  In*  a  haute  ctdlecthNi  of 
techniques  with  d'vwras  goats, 

Apart  from  uettain  application*  such  at  spanker  recognition,  speech  ptraexsli.g  is  conceroird  with  the  tutendad  massagai 
with  trenatnilitu  it,  tecognJaing  it,  or  genaradng  It.  It  Is,  therefuse ,  a  namrwar,  mure  fucussed  activity  than  Image  pnaettlttg, 

lira  dUeruaion  that  foiiows  excludes  certain  kinds  of  social  cunmiuaication  such  as  "Hallo,  Imw  are  you?,"  whet*  ths 
speaker  la  nuf  so  much  enquiring  Inks*  the  tiara  of  health  of  the  listen**  aa  making  s  sami  voluntary  anmauwemenl  of  his  or  Iter 
feelings  and  relationship  hi  lira  IUM*r,  Ihia  us*  of  apaech  ia  aimiiar  to  tha  way  lit  which  a  dog  might  baik  a  greeting  at  its 
max  an  or  a  threat  at  an  intruder,  It  It  me  what  tttakt*  human  spearh  t|**dal,  sod  It  it  run  of  pilmaiy  Interest  In  cnnuiumli  atlng 
with  machine*  or,  presumably,  In  milllwy  lotimtunksiltmt 
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Speech  communication  can  be  usefully  compared  with  man-made  artificial  communication!  signal!,  luch  as  H.F.  tele¬ 
printer  transmtedons  or  telephone  dialing  signals.  In  such  signals,  there  ii  quite  clearly  a  message,  and  the  message  is  laid  out 
sequentially  in  time  or  space  just  like  speech.  The  similarides  to  speech  are  obvious;  the  differences  much  less  so,  but  they  are 
nonetheless  large  and  worth  looking  at 

The  artificial  signals  in  our  examples  are  composed  of  a  sequence  of  units,  the  units  being  selected  from  a  definite,  known 
set  that  we  could  call  an  alphabet.  The  units  in  a  message  are  generally  well  separated  from  each  other,  and  they  do  not 
interact  (Figure  5).  The  decoding  device  usually  h«a  available  to  it  in  some  form  an  ideal,  undlstorted  representation  of  the 
alphabet,  and  decoding  consists  mainly  of  trying  to  identify  the  received  units  one  by  one  using  its  built-in  knowledge  of  the 
ideal  forms. 

What  is  the  equivalent  of  these  units  tor  the  speech  signal?  There  seems  to  be  no  single  exact  equivalent  Perhaps  the 
closest  candidate  is  the  word,  but  words  differ  in  several  major  respects  from  our  artificial  units. 

First  of  all  —  notwithstanding  our  prejudices  from  the  written  form  of  language  — •  spoken  words  do  not  in  general  have 
gaps  between  them  (see  Figure  3,  where  the  only  gap  occurs  before  the  "t”  in  "delta").  Meed,  there  are  no  consilient  acous¬ 
tic  cues  of  any  kind  to  word  boundaries.  What  is  more,  not  only  are  words  not  well  separated  from  each  other,  they  often 
interact  at  their  boundaries.  For  instance,  "bread  board"  is  often  pronounced  in  fluent  English  in  a  way  that  we  might  write  as 
"breab  board,"  and  "this  shop"  as  "thlsh  shop."  Indeed,  In  fluent  speech,  short,  low-content  words  such  as  the,  of  tud  a  are 
so  strongly  influenced  by  their  context  that  they  are  often  unrecognisable  when  excised  from  it. 

Next,  we  know  of  no  Ideal  reference  forms  of  words;  any  normally  pronounced  version  of  a  word  Is  as  good  as  any  other, 
and  no  two  productions  will  ever  be  exactly  the  same.  In  particular,  words  differ  in  their  prosodic  features  (intonation,  timing 
and  loudness)  depending  on  their  function  in  s  sentence.  Even  in  such  ■  prosaic  utterance  as  a  list  of  digits,  the  Anal  digit 
dlffars  markedly  from  the  others,  being  typically  60%  longer  end  having  a  falling  intonation  (see  Figure  3).  When  people  try  to 
generate  synthetic  sentences  by  recording  words  in  Isolation  and  playing  them  back  unmodified  in  a  sequence,  the  result  Is 
disastrous  —  each  word  is  perfectly  clear,  but  It  Is  almost  impossible  to  grasp  the  meaning  of  the  sentence. 

When  words  were  suggested  above  as  the  best  equivalent  of  artificial  communication  units,  some  readers  may  have  been 
surprised  thut  phonemes  were  not  proposed.  Such  surprise  would  be  understandable  considering  the  number  of  popular  articles 
on  speech  technology  that  talk  about  speech  being  made  up  of  phonemes  as  though  it  were  like  laying  out  bricks  In  a  line  — 
just  like  the  symbols  In  teleprinter  transmissions.  Proponents  of  phonemes  might  also  point  out  that  the  phoneme  Inventory 
(Just  over  forty  in  English)  is  much  more  maitagesble  —  more  alphabet  sited  —  than  the  enotmous  Inventory  of  words  In  a 
language.  Some  people  might  alto  be  Influenced  by  the  way  words  are  primed  as  n  siring  of  discreia  context-independent 
letters.  Despite  all  this,  phonemes  bear  llltla  resemblance  to  teleprinter  symbols.  If  we  must  have  a  writing  analogue  for 
phoneme  sequences,  quite  a  good  one  Is  provided  by  hastily  scribbled  handwriting,  in  which  Individual  letlers  are  hard  to  Iso¬ 
late  and  depend  fur  their  form  on  the  other  letters  srouml  them, 
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A  phoneme  is  deflnod  u  the  smallest  unit  of  speech  within  a  word  that  when  changed  results  in  a  change  in  Out  meaning 
of  the  word.  Thu*,  tha  English  word  tap  differ*  from  the  English  word  cap  In  the  position  of  the  tongue  it  the  imrt  of  the  two 
word*.  In  ay  the  poliu  of  contact  between  the  tongue  ind  the  roof  of  the  mouth  i*  just  behind  the  upper  teeth,  while  in  cap  it 
It  towsrdt  the  beck  of  the  mouth.  We  cm  conclude  that  cap  and  tap  mutt  start  with  a  different  phoneme.  We  could  have 
started  with  the  longue  making  contact  in  other  places:  It  could  have  been  directly  behind  the  upper  teeth  like  the  "t"  sound  in 
eighth,  or  the  tip  of  the  tongue  could  have  been  cuffed  back  liightly  tike  the  "t"  in  Ire*.  If  we  used  either  of  theae  "t" 
sounds  In  our  word  tap  we  would  not  get  s  new  word,  we  would  limply  have  tap  with  a  slightly  non-standard  pronunciation  — 
we  might  not  even  notice  that  the  word  sounded  odd  If  it  occurred  In  fluent  speech.  Yet  these  same  "t“  sounds  represent 
different  pi  tonemes  for  some  other  languages.  For  speakers  of  such  languages  (which  include  several  major  languages  spoken 
In  India)  the  "t"  variants  presumably  sound  quite  distinct.  In  the  same  way,  the  English  "1"  and  "r"  sounds  to  words  like 
lap  and  rap,  which  sound  quite  different  to  English  speakers,  do  not  correspond  to  different  phonemes  in  Japanese,  so  Japanese 
speakers  have  difficulty  in  making  the  distinction. 

Thus,  cues  that  provide  phonemic  distinctions  in  a  language  are  much  more  noticeable  thM  those  that  do  not.  English  la 
often  considered  not  to  have  nasal  lad  vowels,  but  In  fact  they  are  ai  common  In  HngHsh  as  in  French;  it  Is  simply  that  nasali¬ 
sation  is  not  phonemic  in  English  that  in,  nasalisation  cannot  change  the  meaning  of  a  wool,  and  Its  presence  Is  optional. 
While  the  vowel  in  French  canne  it  never  nasalised,  that  in  English  ran  almost  always  is,  though  we  probably  would  not 
notice  if  It  were  not,  Failing  to  nasalise  a  French  natal  vtwel  ii  very  noticeable:  It  produces  either  nonsense  nr  a  different 
word;  for  Iniunce,  baton  (stick)  would  turn  Into  bateau  (boat)  In  standard  French  It  the  nasalisation  were  removed. 

Phonemes,  then,  are  not  “speech  sounds"  in  some  absolute  sense,  they  are  a  property  of  the  way  a  language  gets  coded  in 
sound,  and  their  phonetic  realisation  Is  frequently  nontext  dependent.  Something  interesting  la  happening  in  standard  French 
right  now:  the  vowel  aounda  in  the  words  Jet  (jet)  and  gel  (frost)  used  to  be  different  phonemes,  that  la  to  say,  there  existed 
pairs  of  words  such  as  prt  (meadow)  and  prhs  (near)  that  differed  just  by  the  fact  that  the  first  had  the  jet  vowel  in  it  and  the 
second  the  gtl  vowel.  French  speaker*  are  increasingly  using  a  now  rule  that  sayi  that  thn  Jet  vowel  can  occur  only  at  the  end 
of  a  word  and  the  gel  vowel  only  when  followed  by  a  consonant  sound.  Thus  the  prt/prh  distinction  Is  ion,  and  the  two 
vowels  hive  become  context-dependent  all ophones  of  the  same  phoneme.  French  has  lost  a  phoneme,  but  it  has  not  lost  a 
speech  sound. 

So  far,  we  have  established  that  a  phoneme  docs  not  correspond  to  a  single  speech  sound,  but  perhaps  we  could  say  that  It 
corresponds  to  a  set  of  sound).  If  by  “sounds"  we  mean  something  we  can  hear  Md  identify  in  Isolation,  the  tuiiwer  haa  to  be 
no,  or  at  least  not  always.  The  English  word  dell  (or  the  flntt  syllable  of  delta  In  Figure  3)  is  made  up  of  three  phonemes  /cV 
and  M  and  N  (phonemes  are  conventionally  written  between  oblique  llnei);  but  ai  Figure  3  ahows  the  ayllablc  consists  of  a 
continuous  acoustic  sequence,  and  there  is  no  way  of  pronouncing  the  /d/  without  also  pronouncing  the  vowel  after  it.  If  we 
take  a  recording  of  del <  and  listen  to  whit  happen*  ui  we  successively  chop  off  more  and  more  of  the  end,  we  never  get  to  hear 
a  AV  in  isolation:  when  wc  have  shortened  It  enougi:  that  we  no  longer  hear  the  vowel,  we  no  longer  hear  Mythlng  that  wo  per¬ 
ceive  as  speech. 

Vowels,  of  course,  can  be  produced  and  peicelved  In  isolation.  But  in  the  dell  example  Just  described,  when  the  word  has 
been  shortened  to  the  point  where  the  “1"  sound  )  no  longer  heird,  the  vowel  is  not  perceived  as  the  “e"  In  dell  but  as  the 
reduced  {schwa)  sound  heard  In  an  unstressed  the. 

The  picture  of  what  a  phoneme  might  be  in  acoustic  terms  gets  even  fuzdei  when  we  stafl  to  ask  about  the  acoustic 
features  a  listener  might  use  to  decide  what  phoneme  sequence  he  or  she  Is  hearing.  By  using  a  speech  synthesizer,  researchers 
have  been  able  to  vary  the  properties  of  speechlike  sounds  and  so  Investigate  the  phonetic  cues  that  listeners  use.  It  turns  out 
that  they  often  do  not  depend  on  a  single  cue  but  rather  weigh  the  evidence  from  several  Independent  features.  Some  results 
have  been  particularly  surprising,  For  example,  the  words  ones  and  once  are  normally  felt  to  differ  just  in  their  last  phoneme, 
ones  ending  in  the  voiced  phoneme  M  and  once  in  the  corresponding  voiceless  phoneme  /s/;  bur  It  is  possible  to  change  a 
listener's  judgment  of  which  word  Is  being  presented  merely  by  altering  the  length  of  the  M  sound  (a  longer  /n/  causing  ones 
to  be  heard).  Indeed,  this  Is  probably  the  most  important  phonetic  cue  in  discriminating  between  these  words  in  natural  speech. 
Here  we  have  an  example,  then,  where  the  major  distinguishing  mark  of  a  phoneme  is  not  only  not  what  we  would  expect  it  to 
be,  it  is  not  even  where  we  would  expect  to  find  it. 

Moreover,  cues  to  phoneme  identity  are  not  even  entirely  confined  to  the  auditory  channel:  in  appropriate  circumstances 
visual  cues  cm  be  integrated  Into  speech  perception,  The  point  has  been  convincingly  demonstrated  [6]  by  synchronizing  a 
recording  of  a  plosive-vowel  sequence  --  e  g.  "ba"  —  with  a  video  recording  of  a  person  producing  a  different  stop  consonant 
followed  by  the  same  vowel  —  e.g,  "gu."  The  perception  of  the  sound  is  strongly  modified  by  the  conflicting  visual  cues  —  in 
the  ba/ga  example  what  Is  perceived  Is  "du.”  The  effect  has  perhaps  to  be  seen  to  be  fully  believed:  when  I  saw  a  demonstra¬ 
tion  I  “heard"  a  perfectly  natural  "da"  as  long  as  I  watched  the  screen;  as  soon  as  I  looked  away  it  reverted  to  “bn.” 

Speech,  then,  clearly  CMnot  le  considered  as  a  simple  sequence  of  speech  sounds,  nor  even  as  a  sequence  of  discrete 
wotds.  At  the  acoustic  level,  there  are  no  known  discrete  units  whose  ideal  forms  can  be  defined.  This  is  a  further  reason  why, 
in  contrast  to  artificial  signals,  it  is  useful  to  study  the  production  of  the  speech  signal  in  order  to  describe  its  acoustic  proper¬ 
ties. 

The  fact  remains,  however,  that  we  do  have  a  strong  internal  impression  of  speech  as  being  made  up  of  neat  sequences  of 
words  and  words  u  being  made  up  of  neat  sequences  of  discrete,  context-independent  speech  sounds. 

Visual  perception  may  provide  a  clue  to  what  is  going  on  in  speech  perception.  Information  on  a  scene  reaches  us  as  a 
two-dimensional  pattern  of  light  on  our  retinas,  yet  we  perceive  a  world  of  three-dimensional  objects.  This  process  is  not 
strictly  dependent  on  stereo  imaging  or  on  the  lens  adjustment  needed  for  focusing,  because  we  have  no  difficulty  in  interpret¬ 
ing  scenes  on  a  television  or  cinema  screen,  where  such  information  is  absent.  Our  visual  perception  is  not  a  passive  reception 
of  a  pattern  of  light  but  rather  an  active  reconstruction  of  a  scene  based  on  the  visual  evidence  Md  on  our  knowledge  of  the 
world.  People  who  have  been  blind  from  birth  and  who  gain  visual  function  as  adults  are  taid  »o  have  great  difficulty  in  learn¬ 
ing  to  see;  even  though  the  information  transmitted  by  their  optic  nerves  may  be  the  same  as  that  of  other  sighted  people,  they 
have  simply  not  teamed  to  interpret  that  information.  In  notmal  individuals  this  interpretation  is  unconscious  Md  cannot  be 
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tinned  off.  Wbe.i  we  look  u  a  drawing  or  a  painting  of  a  iccne  we  automatically  interpret  it  in  three  dimen  lions.  If  the  picture 
contains  paradoxes  that  prevent  a  consistent  three-dimensional  interpretation,  such  as,  for  example,  in  many  of  the  works  of  the 
artist  M.C.  Eschf ,  wo  cannot  choose  to  avoid  the  paradox  by  perceiving  the  pioture  a a  a  meaningless  pattern  of  light  and  dark 
on  a  flat  piece  of  paper.  Instead,  we  ire  compelled  to  go  on  trying  to  "make  sense"  of  It  si  a  three-dimensional  scene. 

Just  as  mar  visual  system  does  not  function  like  a  camera  passively  recording  incident  light,  so  our  perception  of  speech 
cannot  be  likened  to  the  action  of  a  microphone  passively  transcriblnK  acoustic  signals.  Rather,  we  actively  reconstruct  the  mes¬ 
sage  from  the  various  phonetic  and  proaodic  cues  In  the  signal  together  with  our  knowledge  of  the  vocabulary  and  syntax  of  the 
language,  of  what  would  be  meaningful  and  germane  to  the  situation,  and  of  the  known  habits  of  the  speaker. 

11111  reconstruction  process  is  so  effective  end  automatic  that  we  are  not  normally  aware  that  it  Is  going  on.  On  the  tele¬ 
phone,  for  oxample,  we  rarely  notice  that  the  final  “»,"  "th"  and  “P  sounds  of  toss,  lath  and  laugh  are  virtually  indistin¬ 
guishable;  It  is  only  when  we  have  to  note  down  an  unfamiliar  name  that  we  become  aware  of  just  how  much  acoustic  infor¬ 
mation  ia  missing. 

Even  when  the  acoustic  signal  li  undegraded,  our  perception  of  speech  sounds  cm  be  switched  by  information  front  other 
parts  of  the  sentence,  Thus,  when  we  are  primed  with 

the  dog, r  chased  th t  cats, 

we  tend  to  hear  the  completion  of  the  sentence  as 

and  the  cats  thinned  up  the  tree, 

whereas  If  we  ore  primed  with 

the  dog  chased  the  cat, 

we  tend  to  hear 

and  the  cat  shinned  up  the  tree , 

even  though  the  second  half  of  the  sentence  would  be  pronounced  identically  in  the  two  cases. 

It  is  the  reconitmcdon  of  a  spolen  menage  that  gives  us  such  a  firm  impresaion  that  the  speerh  signal  consists  of  a  neat 
sequence  of  phonemes:  It  may  indeed  be  possible  to  describe  speech  in  this  way,  but  only  at  a  certain  stage  of  processing  in 
our  brains,  not  at  the  level  of  the  acoustic  signal. 

The  information  used  In  reconstructing  a  spoken  message  can  be  drawn  from  many  different  levels  and  uses  both  acoustic 
Information  and  the  listener’s  knowledge.  We  have  already  noted  the  existence  of  phonetic  cues,  which  Indicate  word  struc¬ 
ture,  and  prosodic  cues,  which  generally  Indicate  sentence  structure  and  point  to  the  location  of  significant  information  in  the 
sentence.  In  addition  to  the  rules  that  govern  the  order  in  whloh  words  ran  be  uttered  In  the  syntax  of  a  language,  there  are 
also  agreement  rules,  such  as  those  between  a  verb  and  Its  subject  and  between  an  adjective  and  noun  it  qualifies.  Though  lim¬ 
ited  In  Engllih,  such  rules  can  provide  much  disambiguating  Information  in  other  languages.  Gender  distinctions,  for  example, 
can  operate  like  a  check  bit  in  a  coding  scheme:  the  French  words  bolsson  (drink)  and  poisson  (fish)  ore  acoustically  similar, 
but  since  they  differ  in  gender,  the  phrases  une  bolsson  dtltcleuse  (a  delicious  drink)  and  un  poisson  dilicteux  (a  delicious  fish) 
are  much  more  distinguishable.  A  listener  might  also  apply  expectations  that  a  sentence  should  be  meaningful  and  germane  to 
the  situation.  Finally,  the  work  with  synchronized  video  recordings  demonstrates  that  in  some  circumstances  optical  informa¬ 
tion  ia  used  in  reconstructing  the  speech  message. 

The  amount  of  external  cues  needed  fpr  effective  reconstruction  depends  on  the  predictability  of  the  message:  a  mere 
grunt  might  be  perceived  as  Merry  Christmas  on  December  25'th;  but  if,  for  some  reason,  one  wanted  to  greet  someone  in  that 
way  in  mid-summer,  the  words  would  have  to  be  very  clearly  articulated. 

The  list  of  different  sources  of  Information  that  can  be  used  in  decoding  a  spoken  message  points  up  another  way  in 
which  speech  differs  from  the  teleprinter  transmission,  namely,  the  fact  that  speech  has  to  be  regarded  as  a  multilevel  sequence. 
Thus,  words  can  be  thought  of  as  phoneme  sequencer,  while  they  themselves  form  part  of  word  sequences  making  up  phrases, 
which  in  turn  make  up  sentences.  Evidence  needed  to  understand  speech  ii  present  at  every  level,  and  the  evidence  at  all  lev¬ 
els  probably  has  to  be  considered  simultaneously  if  the  message  ia  to  be  understood.  It  ia  true  that  we  could  find  much  the 
same  set  of  levels  in  a  teleprinter  transmission  of  meaningful  text,  bu1  the  levels  are  not  so  intimately  mixed:  in  order  to  decode 
the  individual  teleprinter  symbols  we  do  not  even  need  to  know  what  language  the  text  is  written  in. 

Speech  is  often  said  to  be  a  redundant  signal.  It  is  argued  tint  the  same  utterance  can  be  understood  either  when  it  is 
low-pass  filtered  it  1kHz  or  when  it  is  high-pass  filtered  at  1kHz,  so  the  information  below  1  kHz  must  be  duplicating  the 
information  above  that  frequency.  This  reasoning  is  faulty.  The  amount  of  information  one  needs  in  a  speech  signal  depends 
on  how  skilled  one  is  at  reconstructing  the  message:  much  mote  acoustic  information  is  needed  when  the  topic  of  the  message 
is  obscure  or  when  a  language  is  being  used  that  is  not  the  native  language  of  the  listener,  even  though  all  the  words  and  con¬ 
structions  may  be  familiar,  Native  speakers  presumably  have  better  infotmation  on  the  relative  probabilities  of  words  and  con¬ 
structions.  Wiutt  may  be  more  important,  they  also  know  which  constructions  are  not  allowed  in  the  language,  while  non-native 
speakers  cannot  dlstinguiah  between  impossible  constructions  and  constructions  that  are  unfamiliar  to  them  but  that  are 
nevertheless  possible.  As  a  result,  the  non-native  speaker  may  waste  valuable  processing  effort  on  the  pursuit  of  hypotheses 
that  a  native  speaker  would  not  even  consider,  just  as  chess  masters  are  said  not  to  see  bad  moves. 

There  are  tradeoffs,  then,  between  the  information  available  in  the  listener’s  brain  and  the  it/formation  needed  in  the  signal 
itself.  Speakers  apparently  take  this  tradeoff  into  account  and  adjust  the  amount  of  acoustic  information  they  provide  for  each 
word  in  their  speech  in  the  light  of  their  subconscious  estimates  of  the  predictability  of  these  words  (7,6).  For  instance,  work¬ 
ing  with  phrases  in  Swedish,  Hunnlcutt  found  that  the  word  corresponding  to  "the  letters"  excised  from  the  spoken  phrase 
“During  the  morning  the  postman  quickly  delivered  the  letters  which  he  had  collected  during  the  weekend,"  where  its 
occurrence  is  predictable  from  the  context  is  less  easily  recognized  when  presented  in  isolation  than  when  the  word  is  excised 
from  the  following  context  in  which  it  is  leu  predictable:  "Curiously  the  man  examined  the  letters  which  h  had  found." 

Speakers,  then,  do  not  emit  speech  messages  to  be  picked  up  by  anyone  who  cores  to  listen,  they  talk  to  someone . 
Although  we  as  yet  know  too  little  about  speech  to  be  sure  about  this,  it  seems  likely  that  s  speaker  puts  just  enough  cues  into 
the  speech  to  allow  the  listener  (or  imagined  listener  in  the  case  of,  uy,  s  radio  broadcut)  to  be  able  to  comfortably  recon- 


saruct  the  meaM«n  from  the  evidence  available.  Thu*,  when  we  are  saying  something  that  it  difficult  to  follow,  or  when  we  are 
speaking  to  sou  ■«*  we  believe  to  be  foreign,  deaf  or  tenile,  we  supply  more  phonetic  information  than  we  would  in  a  relaxed 
conversation  with  a  friend.  Elision  of  phonetic  information,  such  as  when  we  say  fish  ’n  chips,  is  often  ascribed  to  laziness,  but 
if  can  be  at-.in  to  1  s  a  rational  strategy  for  the  economical  use  of  a  communications  Unit:  it  would  be  lazy  only  if  the  person  at 
the  orl.er  ;nd  of  ti  e  link  were  obliged  to  make  an  unreasonable  effort  to  reconstruct  the  message,  Depending  on  the  cir- 
oumruice^  o v<  articulation  can  be  Just  as  inappropriate  as  underardouladon:  it  can  aound  stilted,  iiritadng,  even  InsulUng  when 
the  llstimt  feats  it  to  be  unnecessary, 

To  lUiOmaize  this  section:  the  speech  signal  is  different  in  nature  both  from  mettagele**  signals  such  is  satelUte  images 
and  from  m  whine-generated  message-bearing  signals  like  the  teleprinter  transmission.  It  is  a  signal  from  which  a  message  may 
be  rot  (instructed  using  information  drawn  from  many  sources,  both  information  at  various  levels  in  the  signal  itself  and  Infor¬ 
ms;  lun  stored  in  the  mind  of  the  listener.  The  amount  of  information  that  the  speaker  putt  into  the  signal  depends  on  the 
difL'oulty  that  he  Imagines  the  listener  will  have  in  roconitrucdng  the  message  from  it. 

5.  Speech  and  Writing 

As  a  species,  we  developed  speech  long  before  we  developed  writing.  As  individuals,  wc  leant  to  speak  before  we  learn  to 
write,  and  speech  remains  for  most  of  us  our  primary  means  of  communication  with  each  other. 

It  may  seem  surprising,  then,  that  when  we  think  about  verbal  communication  our  Image  is  drawn  almost  entirely  from 
written  communication.  But  text  is  literally  easier  to  visualize  than  speech,  Unlike  ephemeral  speech,  text  stays  on  the  page  to 
be  examined.  At  school,  our  assignments  and  examinations  are  overwhelmingly  in  written  form.  We  become  conscious  of  the 
rules  governing  written  language  and  skilled  in  applying  them.  We  come  to  regard  everyday  spoken  communication  —  if  we 
think  about  it  ct  all  —  as  an  inferior  version  of  the  written  language,  a  version  lacking  in  elegance  and  littered  with  errors.  At 
a  conscious  level,  at  least,  we  tend  to  ignore  those  features  of  spoken  language,  such  as  prosody,  that  are  not  represented  In  the 
written  form. 

We  saw  in  the  previous  section  that  printed  text  with  its  discrete,  context-independent  letters  and  words  can  incite  a  false 
Impression  of  whit  speech  Is  like.  Printed  text  probably  reflects  an  internal  representation  of  speech  after  much  sophisticated 
processing  has  been  applied  to  it.  But  the  cultural  importance  of  printed  text  has  meant  that  the  properties  of  text  have  been 
projected  back  onto  speech,  relnforoing  the  belief  that  our  internal  impression  of  the  speech  signal  corresponds  to  an  external 
reality. 

A  similar  phenomenon  occurs  when  we  think  about  the  style  of  language  appropriate  for  speech,  Yet  even  though  the  for¬ 
mal  rules  of  grammar  underlying  the  two  modes  of  communication  are  generally  thought  to  be  the  same,  the  styles  of  language 
appropriate  for  writing  and  speaking  are  different.  In  terms  of  these  formal  rules,  spoken  language  Is  more  arorful,  partly 
because  we  have  much  leu  time  to  plan  and  polish  our  spontaneous  speech  than  we  have  our  writing,  though  many  so-called 
erroi  In  speech  may  aotually  be  observances  of  different  rules.  Certainly,  spontaneous  speech  with  all  Its  apparent  ungramma¬ 
tically,  redundancies,  hesitations  and  Incomplete  constructions  Is  usually  easier  to  follow  thin  text  In  written  style  being  read 
aloud. 

Papers  delivered  at  conferences  are  often  all  but  impouible  to  follow  because  the  presenter  is  reading  a  text  written  In  a 
style  appropriate  for  a  reader  but  not  for  a  listener,  When  the  presenter  departs  from  his  text  to  comment  on  a  slide  or  to 
answer  a  question  he  generally  becomes  much  easier  to  follow,  even  though  his  language  might  appear  to  be  leu  well  formed. 
Speakers  are  sometimes  tempted  to  read  a  prepared  text  because  they  believe  it  allows  them  to  pack  more  Information  into  a 
limited  time.  It  does  Indeed  allow  them  to  transmit  more  information,  but  it  does  not  allow  their  audience  to  naive  more 
information.  The  rate  at  which  we  transmit  information  in  a  well  planned  talk  without  a  prepared  text  probably  corresponds  to 
the  rate  at  which  the  audience  can  abaorb  it. 

Until  recently,  an  example  of  the  inappropriateneu  of  written  style  in  ipoech  could  be  heard  regularly  on  a  U„S,  television 
network  when  a  sponsoring  corporation  described  itself  u: 

providing  high-technology,  computer-based  systems  solutions  to  the  complex  problems  of  buslneu,  government  and 
defense." 

Admittedly,  this  Is  not  the  snappiest  of  sentences  even  u  text,  but  when  spoken  it  is  particularly  indigestible.  Participial 
phrases  like  this  one  are  not  common  in  .peech,  This  example  contains  a  large  proportion  of  lung,  relatively  rare,  ostensibly 
high  information -content  voids,  while  real  speech  contains  more  short,  common  words,  Msny  of  these  long  words  act  as 
qualifiers  piled  up  before  'solutions'  lo  an  extent  that  strains  our  auditory  ability  to  hang  In  unill  the  noun  comes  along,  When 
we  read  the  same  sentence  this  problem  doe*  not  arise, 

A  similar  piling  up  of  subsidiary  Information  that  seems  to  l»  unacceptable  in  speech  but  common  in  writing  occurs  when 
a  main  clause  is  preceded  by  a  subordinate  qualifying  clause  starting  with  although,  while  or  since,  The  rarity  or  such  construc¬ 
tions  in  speech  is  presumably  due  to  ’he  streln  they  put  on  our  alxUty  to  wait  for  the  subject  of  the  main  clause, 

The  word*  commonly  used  to  link  or  separate  ideas  in  spontaneous  speech  an  generally  different  from  those  used  in  text. 
Words  like  moreover,  however,  nevertheless,  thus,  therefore,  consequently,  and  many  others  nre  common  in  text  but  rare  In 
speech,  Simple  link  words  like  and,  but  and  so  ure  mote  common  in  speech  than  in  text,  and  i  further  set  of  link  words  Uke 
well,  OX,,  right,  look,  besides  and  anyway,  are  common  In  sperc'i  but  rue  In  text. 

Partly  because  text  fails  to  reproduce  many  of  the  cues  supplied  in  speech,  faithfully  transcribed  spontaneous  spoken  dialo¬ 
gue  is  alt  but  IntwnpnhMalbk.  tliubba  (9]  provides  •  uanacriptkm  of  a  con  variation  between  himself  and  two  schoolboys, 
which  —  though  he  anuses  us  it  was  wall  ordered  and  cooptehwaible  to  a  listener  at  the  Uum  •-  is  almost  Uremeilblc  to  fol¬ 
low  as  e  text 

fhspanls  [10]  has  described  an  experiment  that  allowed  direct  comparison  between  spon Unworn  speech  end  written  com¬ 
munication.  Pairs  of  subjects  who  wen  physically  asperated  from  each  other  were  required  to  cmy  out  a  task  needing  their 
active  cooperation.  Performance  on  the  teak  was  when  vaittsu  channels  of  communication  ware  provided,  Theee 

included  a  speech  link,  a  teleprinter,  a  means  of  passing  handwritten  not  s,  and  a  video  link,  together  with  various  combina¬ 
tions  of  these  channels,  Subjects  were  abta  to  complete  the  taski  roughly  twice  m  quickly  with  any  combination  that  Included 


2-11 


voice  u  they  could  with  tny  combination  that  did  not  include  voice.  Transcriptioni  of  voice  communication!  showed,  as  one 
might  expect,  a  high  proportion  of  tnal-formed  or  incomplete  sentences,  but  so  did  the  written  communications  in  those  cases 
where  written  messages  could  be  paired  back  and  forth  without  any  delay.  It  seems  that  failure  to  obterve  the  formal  rules  of 
grammar  may  be  a  feature  of  any  spontaneous  dialogue  rather  than  specifically  of  spoken  dialogue.  Subjects  used  about  five 
rimes  si  many  words  to  solve  the  same  task  using  a  voice  link  as  they  did  using  a  text  link,  but  they  delivered  the  spoken 
words  ten  times  faster  than  the  written  one*. 

If  humans  optimise  their  use  of  tny  oommunlctdont  channel,  then  these  results,  and  (ha  differences  in  ipoken  and  written 
style  noted  earlier,  are  consistent  with  the  idea  that  writing  or  reading  it  single  word  is  relatively  mote  expensive  in  time  or 
effort  compared  with  speaking  or  hearing  a  word,  but  recognizing  ipoken  words  is  a  more  uncertain  process  than  recognising 
written  words.  In  particular,  the  occurrence  of  unusual  or  unlikely  words  or  of  complex  sentence  structures  poses  more  of  e 
problem  for  e  listener  than  for  a  reader.  The  additional  uncertainty  in  decoding  speech  hai  thus  to  be  countered  by  using  more 
words  and  by  using  simpler  words  and  simpler  constructions;  but  this  need  to  use  more  words  is  offset  by  the  greater  speed 
with  which  spoken  words  can  be  delivered. 

Printed  text  is  either  legible  or  illegible,  On  the  other  hand,  when  we  first  hoar  a  speaker  we  have  to  adapt  to  the  peculiar¬ 
ities  of  the  voice.  If  the  speech  Is  unusual,  If  the  speaker  hu  a  foreign  accent,  for  example,  or  if  the  acoustic  signal  ia 
degraded,  u  it  is  on  the  telephone,  there  is  i  noticeable  period  over  which  we  need  to  be  gathering  Infotmatlon  on  the  oharac* 
teristiui  of  the  speech  we  ere  hearing,  Consequently,  the  most  effective  speech  to  use  at  this  point  is  the  most  predictable 
speech,  since  anything  that  la  not  predictable  will  not  be  understood  and  will  be  less  useful  in  supplying  the  information  that 
the  listener  needs.  The  influence  of  text,  however,  may  lead  us  to  ignore  the  need  for  this  adaptation  period.  For  tixamyile.  the 
Qflki  d»  la  Lomu*  Francois*  of  the  Government  of  Quebec  produces  s  booklet  (li)  giving  advice  on  the  use  of  the  telephone 
in  French.  They  recommend  that  private  individuals  should  answer  tits  phone  by  saying  Just  Alibi,  rather  than  Alibi  j' Houle. 
In  the  case  of  the  formula  to  be  used  by  receptionists  or  telsphonists  In  answering  the  phone  on  behalf  of  an  organization,  they 
raoommend  giving  juai  the  name  of  the  orginiiatlcn,  In  both  cases,  they  assert  that  adding  bonjour 

"Is  not  only  usekai  bui  alio  incorrect."  [translation), 

This  advice  rafloou  Influence  from  written  language,  where  good  style  requires  the  number  of  words  used  to  be  ndnimlMd,  tu 
opposed  to  spoken  language,  where  additional  words  cost  little  In  time  or  effort  and  where  predictable  words  like  bonjour  or 
j'Houi*  can  provide  useful  Information  on  channel  charecluristici  at  Use  beginning  of  an  exchange.  Instead  of  forbidding  the 
use  of  bonjour,  the  03c*  could  help  communication  by  encouraging  Its  use  and  by  suggesting  that  it  should  be  spoken  Ik  for* 
the  name  of  the  organisation  rather  than  after  it,  alncc  it  is  the  moat  predictable  word  possible. 

Speech  output  lyiiems  sometimes  seem  to  have  been  designed  as  though  they  were  to  generate  written  mthor  than  spoken 
messages.  An  example  Is  provided  by  e  system  intended  ut  allow  passengers  with  a  local  bui  company  to  find  out  by  tele¬ 
phone  when  the  next  bus  ia  due.  Ilauh  stop  has  a  unique  number  that  cotres|h>nds  to  a  telephone  number.  Dialing  that  number 
rauiei  a  •'omputer-con trolled  speech  output  system  to  generate  a  message  concerning  tlw  expected  arrival  of  hums  at  that  stop. 
On  dialing  a  bus-stop  number  e  passenger  hears  a  message  of  the  form: 

"(Hua  company]  schedule  for  stop  8342.  Route  3  In  5  and  25  minutes,  Route  57  in  13  end  3D  minutes.  Thank  you." 

We  have  seen  that  more  words  are  needed  In  spoken  messages  than  in  equivalent  written  messages  tl  the  speech  In  tu  be  easily 
undtralood,  but  this  style  la  avtn  more  terse  than  that  of  normal  text,  It  Is  the  style  used  in  telegrams  and  telexes,  where  every 
extra  word  adds  significantly  to  the  cost.  When  ipoken,  it  Is  hard  to  follow, 

Since  this  Is  telephone  speech,  and  slnue  it  is  in  addition  peculiar,  machine-generated  speech,  it  Is  particularly  important  to 
give  the  listener  a  ohiutce  at  the  beginning  of  the  message  to  adjust  to  the  voice  and  to  the  channel,  ’tho  first  sentence  in  ibis 
message,  even  though  It  contain*  little  useful  Information.  Is  not  predictable  enough  to  allow  the  adaptation  to  lake  place.  New 
users,  unfamiliar  with  the  slop  numlwr,  will  hear  an  apparently  random  sequence  of  digits,  which  Is  as  unpredictable  us 
anything  that  occuis  In  speech. 

A  more  comprehensible  messsge  might  he: 

“(lots!  momlng.  I'll#  nest  buses  on  Route  3  are  due  in  3  tnlmiiei  and  In  25  minutes;  and  on  Route  37.  they'ic  due  in  l.t 
minutes  and  in  3H  minutes.  This  information  Is  for  slop  number  H342.  Thunk  yon,  good, bye." 

This  alternative  version  contains  about  twice  as  tiuuty  words  si  the  original,  so  It  might  seem  to  be  open  to  the  objection 
that  each  enquiry  would  take  twice  as  long,  In  fact,  the  stilted  style  of  the  original  message  forces  s  slow  delivery  ami  conse¬ 
quently  makes  the  durations  of  thn  two  messages  comparable, 

The  same  bus  timetable  enquiry  system  also  provides  an  example  of  the  pernicious  effects  at  the  acoustic  level  of  project¬ 
ing  properties  of  speech  onto  text  It  seems  that  words  that  might  vary  from  message  to  message  (destinations,  times,  etc)  were 
recorded  in  Isolation  end  are  then  concatenated  to  form  the  message,  Worse  still,  syhnbies  such  u  i*tn  that  are  common  to 
several  words  ware  recorded  in  isolation.  If  wonla  were  like  Mat,  rids  would  of  course  be  a  reasonable  thing  to  do,  but  we 
have  seen  that  words  are  affected  by  their  context  and  by  their  function  in  the  sentence.  The  J  in  HJ42  dors  not  sound  like 
the  J  in  Haute  )  in  mantel  speech,  The  effect  of  recording  the  wwvltt  In  isolation  Is  to  destroy  the  prosodic  cues  to  sentence 
structure  ami  to  give  each  wont  prosodic  cues  etxrer  ponding  to  strong  emphasis,  It  also  itffe.ni  the  phonetic  comem  to  some 
extent.  Pur  example,  tits  word  and  when  used  in  fluent  speech  hu  a  centralized  vowel  or  often  mi  vowel  ut  alt,  something 
nsore  like  'ml  or  even  'it.  The  kind  of  anJ  pnviucad  In  laolatlon  is  nut  in  tturm  speech:  It  ocowo  only  when  the  speaker 
wants  to  strevr  thss  ■  npnrtanl  additional  Information  is  (icing  added,  as  in: 

"Sudio  gels  your  dishes  dean  ml  it's  kind  to  your  hernix" 

It  uti.yks  v"  .  '-•■r1  tu  oupiuuliad  version  of  the  word  would  be  easier  to  recognise  in  any  context,  since  it  contains  clearer 
phtvi-,1,  .-»v  i  im  this  infOrmatkMi  is  < Misleading  whan  tin  listener  Is  trying  to  understand  the  structure  of  the  whole 
sem>  ->M.  7’<  ?  '  ■  Vi  sual  or  deletion  of  the  vowel  In  a  normal  mul  provide*  thn  uarfttl  Information  that  this  ixxiuimice  of 

the  '.m>‘  Aw  'f»v  emphasis, 
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As  we  noted  In  the  previous  Motion,  in  •  menace  mode  dram  concatenated  isolated  weeds  any  single  word  is  perfectly 
dead  but  tbe  message  as  a  whole  is  not  In  the  com  of  the  bus  schedule  system,  even  if  users  can  manage  to  identify  every 
word  despite  the  confusing  prosodic  information,  they  generally  find  it  difficult  to  retain  the  information  about  arrival  time* 
they  wen  seeking, 

In  eummary,  the  etyle  need  in  epontaneoue  speech  is  different  horn  that  used  in  writing,  and  the  difference*  do  not  arise 
solely  dram  speech  being  less  well  planned  than  tut  They  reflect  the  different  characteristic!  of  the  two  modes  of  communica¬ 
tion.  Features  present  In  speech  but  abaent  in  last  are  oAan  lgnmeti 


ft  Summary 

In  order  to  approach  the  speech  signal  for  the  puipons  of  automatic  recognition,  efficient  trantmlislon  or  synthetic  generation, 
it  it  useful  to  take  into  account  how  humans  genant*  it  and  how  they  perceive  It  Speech  ii  a  message  bearing  signal,  but  it  it 
more  complex  than  artificial  message  bearing  signals,  containing  no  dearly  identifiable  context-independent  units,  Speech  com¬ 
munication  is  an  Interactive  process  in  which  the  listener  actively  reconstructs  a  message  from  acoustic  cue*  and  the  speaker 
estimate*  the  amount  of  aoouatk  information  necessary  for  the  task,  Printed  text  and  speech  both  convey  verbal  messages,  but 
them  an  large  differences  between  them.  In  particular,  the  styles  tppropriata  for  the  two  model  of  communication  ate  quite 
different. 
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SUMMARY 

Recent  advancer  In  algorithm!  and  techniques  for  apeech  coding  now  permit  high  quality  voioe  reproduction  at  remarkably 
low  bit  rater.  The  advent  of  powerful  single-chip  lignal  proceiaon  ha*  made  it  coat  elective  to  implement  theae  new  and 
sophisticated  apeech  coding  algorithnu  for  many  important  application!  in  voice  communication  and  itorage.  Thla  paper 
revlewi  tome  of  the  main  idea  underlying  the  algorithm!  of  mgjor  Interest  today.  The  concept  of  removing  redundancy 
by  linear  prediction  ii  reviewed,  tint  in  the  context  of  predictive  quantiution  or  DPCM.  Then  linear  predictive  coding, 
adaptive  predictive  coding,  and  vector  quantiution  are  diacuiied.  The  concept!  of  excitation  coding  via  analyiia-by- 
lynthoila,  vector  turn  excitation  codebooka,  and  adaptive  poitflltering  are  explained.  The  main  idea  of  Vector  Excitation 
Coding  (VXC)  or  Code  Excited  Linear  Prediction  (CBLP)  ate  presented.  Finally  low-delay  VXC  coding  and  phonetic 
segmentation  for  VXC  are  described. 

INTRODUCTION 

Speech  is  the  communication  mechanism  that  distinguishes  humaua  from  lowar  animal  forma  and  la  an  easential  part  of 
what  allows  man  to  function  In  civilisation  -  our  sophistic  tied  ability  to  usa  language  and  communicate  directly  with  one 
another  via  an  acoustic  channel.  With  the  invention  of  the  telephone  by  A.O.  Bell,  a  major  advance  in  human  communica¬ 
tion  look  place.  Now  we  can  communicate  "in  real-time"  (not  by  writing  lettera  or  sending  telegrams)  with  one  another 
while  geographically  separated,  perhaps  around  the  world  or  in  an  aircraft  or  space  vehicle.  Of  course  the  telephone  was 
until  recently  based  on  analog  communication:  a  simple  modulation  of  an  electric  current  in  proportion  to  the  instantane¬ 
ous  intensity  of  an  acoustic  signal.  In  recent  decades  digital  communications  emerged  u  a  new  and  prevalent  technology 
and  allowed  us  to  develop  highways  and  superhighways  carrying  a  variety  of  traffic  such  as  data,  video,  and  multiple 
channels  of  voice  with  greater  reliability,  cost  effectiveness,  privacy  and  security,  and  over  hostile  channels  (spread  spec¬ 
trum  methods)  and  troublesome  radio  channels. 

With  the  advent  or  rapidly  increasing  digital  signal  processing  technology,  It  has  recently  become  cost  effective  to  use 
rather  sophisticated  speech  coding  algorithms  In  numerous  commercial,  government,  and  military  communications  appli¬ 
cations.  Speech  coding  Is  already  being  or  becoming  widely  used  In  many  storage  applications  where  the  communication 
process  la  not  necessarily  to  transport  voice  from  one  geographical  location  to  another  but  fiom  one  point  in  time  to  a  later 
point  in  tlmo. 

In  this  psper,  we  tlrst  describe  some  of  the  current  and  emerging  applications  of  speech  coding.  Then  we  lead  into  the 
description  of  lire  main  algorithms  of  Interest  today  by  starting  with  the  bask  ideas  of  predictive  quantization,  DPCM, 
LPC  vocoders,  and  APC  coders.  Next,  we  introduce  the  idea  of  vector  quantization,  then  come  to  excitation  coding  and 
oodera  based  on  analyaii-by-tynthoiia  coding  and  focus  particularly  on  CELP  or  VXC  typo  coders,  Some  recent  develop¬ 
ments  of  importance,  vector  sum  excitation  codebooks,  low-delay  VXC,  and  adaptive  poitflltering  are  reviewed.  Follow¬ 
ing  this  wo  Introduce  the  use  of  phonetic  segmentation  in  speech  coding,  i  new  approach  that  may  contribute  to  the  next 
generation  of  speech  coders. 

APPLICATIONS 

Applications  of  speech  coding  today  have  become  very  numerous,  A  few  examples  ate  listed  here:  Mobile  satellite  com¬ 
munications,  Cellular  Mobile  Radio,  Volce/data  multiplexers  for  public  and  private  networks,  Rural  telephone  radio  car¬ 
rier  systems,  Audio  for  videophones  or  video  teleconferencing  systems.  Voice  messaging  networks,  Universal  cordless 
telephones,  Audio/graphics  conferencing,  DCME  digital  circuit  multiplexing  equipment,  Voice  memo  wrist  witch,  Voice 
logging  recorders,  and  interactive  PC  software.  New  applications  continue  to  emerge  as  digital  signal  processing  technol¬ 
ogy  makes  very  efficient  compression  Increasingly  cost  effective, 
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fig.  1,  Examples  o t  Speech  Waveforms 

■ASICS  OP  SPEECH  CODING 

The  signals  shown  In  Pig.  I  Uluitnle  the  gnat  variety  In  (be  character  of  epaech  wavefomu,  Sometimes  periodic  or 
almoet  periodic,  other  timea  a  mixture  of  periodic  and  random-like  lignala  and  aometlmaa  the  waveform  appeara  like  tun- 
dom  noiae,  Shown  in  the  figure  la  a  10  ma  time  interval,  A  ipeech  coder  operating,  lor  example,  at  4  kb/a  muit  be  able  to 
deeoribe  any  such  10  ma  aegaiant  (SO  aamplea)  uaing  only  40  binary  digits  in  auch  a  way  that  the  aegment  will  be  repro¬ 
duced  with  an  aocuraoy  sufficient  to  Inaura  that  it  will  sound  vary  close  to  the  original,  Unllko  PCM  where  8  bill  are  used 
to  code  each  sample,  in  suoh  a  low  bit  rata  oodar  we  have  only  1/2  of  a  bit  available  per  rample  to  describe  the  aound  or 
the  waveform.  Of  courts  than  it  no  way  to  adequately  describe  the  amplitude  of  a  sample,  even  if  an  entire  bit  were 
available  par  sample  (its  in  the  oaaa  of  an  I  kh/t  oodar),  Thus  wa  must  use  clever  techniques  to  exploit  redundancy  acroai 
templet  by  introducing  memory  in  the  encoding  process,  to  that  we  don’t  manly  examine  one  sample  at  a  time  and  code 
that  sample  (u  in  PCM),  but  we  etora  up  peat  templet,  and/or  information  obtained  from  past  sampler  to  tend  out  essen¬ 
tial  digital  information  that  will  help  ui  to  specify  the  current  temple. 

Speech  coders  have  been  irr  ationally  grouped  into  vocoders  (from  "voice  coders")  and  waveform  coders.  Today  this 
dichotomy  has  become  blum  ith  the  current  generation  of  so-called  hybrid  coders  which  embody  some  aspects  of  both 
of  the  above  oategoriee.  Hybriu  coders  do  attempt  to  raproduoe  the  waveform,  to  some  degree,  while  also  describing  key 
parameters  that  help  to  reproduce  (synthesize)  a  aound  perceptually  similar  to  the  original. 

We  attumt  the  medic  is  familiar  with  PCM  which,  aa  used  In  telephony  today,  aamples  voice  at  8,000  samplers  and 
codes  each  sample  with  an  8  bit  word  uaing  a  nonuniforttt  quantiser  based  roughly  on  a  logarithmic  companding  charac¬ 
teristic.  Nothing  further  will  be  mentioned  about  th'  Suffice  it  to  note  that  a  quantizer  can  be  viewed  es  the  cascade  of  an 
encoder  (A/D  converter)  and  a  decoder  (D/A  converter),  The  encoder  generates  an  index  as  •  binary  word  specifying  the 
amplitude  level  of  the  quantised  value  which  approximates  the  input  amplitude.  Often  the  quantizer  is  viowod  as  a  black 
box  that  generates  both  ths  index  and  the  quantised  level.  The  decoder  (D/A)  sometimes  is  called  an  'inverse'  quantizer 
and  it  simply  maps  the  index  into  the  reproduced  level, 


PREDICTIVE  QUANTIZATION 

A  major  advance  in  waveform  coding  of  speech  was  the  introduction  of  ptedktJve  quantisation.  Fig  2  shows  the  basic 
idea  of  (hit  schema.  Pint,  note  that  a  quantity  V*  is  subtracted  from  the  the  input  sample  Xt  forming  a  difference  sample 
4.  This  difference  is  quantized  and  thee  the  quantity  ia  added  back  to  the  quantised  approximation  of  the  difference 
sample  4,  producing  a  final  output  Jf*.  Without  giving  any  attention  to  how  Vs  li  generated,  It  is  evident  that  the  error  In 
approximating  the  input  umpla  Xt  by  if*  is  exactly  equal  to  the  error  incurred  by  the  quantizer  in  approximating  the 
difference  signal.  This  means  that  If  we  can  somehow  make  vary  close  to  X*,  the  difference  signal  v/ill  be  small,  and 
fewer  bits  will  be  needed  for  quantising  4  so  as  to  make  the  overall  error  in  approximating  Xt  by  Xt  alio  small.  The 
quantity  Xt  is  chosen  to  be  a  linear  prediction  of  Xt  based  oil  previously  reproduced  samples.  The  predictor  has  transfer 
function 


P(.)- JejH 

The  difference  between  the  input  sample  and  tu  predicted  value,  (baaed  on  the  past  information  known  lo  the  decoder),  la 
quantised  and  die  index  specifying  the  quantised  level  of  this  difference  is  sent  to  die  decoder.  Note  that  the  encoder  con¬ 
tains  a  copy  of  the  decoder. 


Fig.  4  Predictive  Quantisation  (DPCM)  with  Pole-Zero  Prediction 

of  pole*  and  nroi  won  used  and  if  tho  synthesis  Alter  ii  adaptive  (and  thui  lime  varying)  to  track  the  changing  ihape  or 
the  vocal  tract.  The  CCITT  32  kbit  ADPCM  landanl,  baaed  on  thii  structure,  hat  6  zeroa  and  2  polei  and  perform*  back¬ 
ward  adaptation  to  make  the  two  predictors  track  the  time-varying  statistics  of  the  speech, 

LPT  VOCODER 

In  an  entirely  different  approach  to  ipeech  coding,  known  u  parameter-coding,  analyilViynthuli  coding,  or  vocodlng,  no 
attempt  ia  made  at  reproducing  the  enact  ipeech  waveform  at  the  receiver,  only  a  ilgnai  perceptually  equivalent  to  it. 
Early  veraiona  of  thia  approach  included  formant  aynthaiiaan  and  ao-celied  "terminal  analog  synthesisers".  However,  the 
moit  widely  uaed  form  today  wu  partly  motivated  by  recog  nixing  the  DPCM  decoder  ai  a  modal  of  the  ipeech  production 
mechanlam.  The  Idea  ii  to  replace  the  quantised  difference  sUnal  by  a  aimpla  excitation  signal  which  at  ieait  crudely  mim¬ 
ics  typical  excitation  signals  getteraled  in  the  human  glottis. 

Figure  3  illustrates  the  decoder  structure  of  an  LPC  Vocoder.  (LPC  standi  for  Linear  Prediptiv'.,  Coding.)  The  encoder 
•ends  a  very  modoat  number  or  bits  to  the  decoder  to  deacribe  each  successive  frame  of  the  speech  to  be  synthesised.  A 
frame  is  a  time  segment  typically  20  to  23  ma  long.  The  excitation  is  specified  by  a  one  bit  voicing  parameter  which  Indi- 
caies  whether  the  frame  of  speech  is  judged  to  be  periodic  or  aperiodic.  Periodic  segments  correspond  to  so-callixl  voiced 
speech  where  the  glottis  periodically  opens  and  cloaca  producing  s  fairly  regular  train  of  pitch  pulses  to  the  vocal  tract.  If 
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Mg.  I  LPC  Vocoder -Decoder 

the  frame  is  voiced,  the  encoder  alio  sends  an  estimate  of  the  pitch  pitted  which  typically  ranges  from  3  to  18  ms,  The 
decoder  locally  generatae  on*  of  two  ixdiationi,  a  periodic  train  of  impulse!  at  the  pitch  period,  or  (for  unvoiced  frames)  a 
random  noise  exclusion  signal,  A  gain  value  must  also  be  transmitted  to  specify  the  correct  energy  level  of  the  current 
frame.  Thus  the  set  of  parameters  specified  for  the  synthesis  filler  in  each  frames  are:  voicing  decision,  pilch  (if  appropri¬ 
ate),  I  .PC  coefficients  (typically  10)  end  gsin,  11m  encoder  of  an  LPC  vocoder,  also  shown  in  Pig.  6,  performs  computa¬ 
tions  on  eacli  tame  of  input  speech  to  determine  the  set  of  parameters  needed  by  the  decoder, 

The  linear  predictor  described  here  end  in  the  context  of  DPCM  ia  often  called  a  thert-ttm  predictor  at  formant  predictor, 
For  later  convenience  we  denote  the  short-term  predictor  by  A,(»)  where  i  indicates  short,  There  names  illustrate  the  fact 
that  the  predictor  exploits  the  short-term  correlation  in  nearby  samples  of  the  ipeech  waveform,  and  the  fact  that  it  Is  the 
short-term  correlation  which  characterises  the  formants  dominating  the  envelope  of  the  speech  spectrum,  Generally  three 
or  four  principal  formants  are  evident  in  examining  the  Fourier  traaaforu  of  a  speech  frtmj,  The  formant  synthesis  filter 
has  a  frequency  mpoere  whore  «**g*UiuU  clot  <4y  oorreepondi  to  the  envelope  of  the  spectrum,  The  tratisfw  function  of 
this  synthesis  filter  ia  [1  -  P,(s  )]-', 

Note  that  the  vocoder  scheme  does  not  actually  ritarept  to  encode  the  speech  waveform  but  only  extracts  some  parameters 
or  features  that  partially  cheraoterixe  each  frame,  Thu  It  does  not  have  the  ability  to  reproduce  an  approximation  to  the 
original  waveform.  Nevertheless,  k  can  syntheeiae  dear,  intelligible  epeeoh  at  the  very  low  bit-rate  of  2400  bfs.  Such 
vocoders  have  eerved  for  yean  as  the  underlying  technology  for  secure  voice  terminals,  which  include  the  ftmetions  of 
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Ftp.  2  Predictive  Quaniiudoa 

Th*  docoder  replicate*  (he  feedback  loop  of  the  encoder.  Note  that  the  linear  predictor  now  appear*  imbedded  in  a  feed' 
back  loop.  The  decoder  it  limply  an  invent  quantizer  which  reproduce*  the  aequence  of  quantised  difference  lample*  and 
feed*  it  into  a  Alter  with  tranifer  function  ll-ft(r)]'1,  called  the  ayntheiia  filter,  to  reproduce  ample*  of  the  original  algnal 
X* 

Th*  performance  gain  of  this  Mructura  ia  due  to  th*  prediction  gain  of  the  predictor,  i.e.  th*  ratio  of  variance*  of  <4  to  vari¬ 
ance  of  X*.  or  the  factor  by  which  the  power  of  the  input  aignal  ii  reduced  after  removing  the  predictable  error,  Thla  pred¬ 
iction  gain  in  dB  ia  what  determine*  the  performance  improvement  over  itralght  PCM. 

Pig.  3  ahowa  the  block  diagram  of  a  DPCM  coder  In  a  more  conventional  form,  which  la  exactly  the  onto  ivhento  aa  In 
Pig.  2,  only  drawn  In  a  lea*  inilghtfol  way,  Dy  comparing  the  two  figure*,  it  It  eaaily  verified  that  they  reprewnt  Identical 
coder*. 

It  ia  intareating  to  not*  that  the  DPCM  decode!  which  generate*  ipeech  from  a  nequenc*  of  difference  r.emplex  model*,  In  a 
primitive  aenw,  the  ipeech  production  mechaniim  in  human*.  The  lyntheil*  filter  can  be  viewed  aa  a  model  of  the  human 
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vocal  tract  and  tire  difference  aignal  aa  a  model  of  the  acouritc  excitation  lignal  produced  at  (he  glottis.  If  ih*  order  of  ihe 
predictor  polynomial  U  resaosably  High,  (8  or  higher)  Uua  tyntheiia  filter  indeed  !uu  a  frequency  raipon**  llrai  reatonably 
correapottda  to  the  overall  Altering  character! Kiel  of  the  human  vocal  tract  with  it*  distinct  ipectrai  peak*,  known  ax  for 
mattu, 


Of  oourae,  th*  human  vocal  tract  1*  in  oomtant  movement  and  thut  iti  frequency  reaponie  viriet  tubrianiially  In  (Into, 
from  on*  phonetic  iouiuI  unit,  or  phonema,  to  another,  Only  over  a  Urn*  Interval  of  a  few  milllocoitd*  i«  it  likely  to  be 
more  or  leaa  conriani,  In  Adaptive  DPCM  (ADPCM),  the  predictor  I*  alao  time  varying  ami  thereby  hu  a  greater  ability 
to  model  th*  ipeech  production  mechaniim, 

Another  impiovasMAl  in  DPCM  la  th*  um  of  poi*-**ro  prediction.  Pig,  4  ahowa  Ih*  predictive  quantlaation  linicture 
modified  by  the  ua*  of  two  predictor*  ft  tU )  *nd  P*(i ),  Bach  take*  a  linear  combination  uf  pari  value*  from  iu  input,  lire 
new  predictor,  ft3,  U  applied  directly  to  th*  quantized  difference  ample*,  whit*  ft|  combine*  the**  with  ih*  preceding 
value  of  lf» ,  to  produce  tit*  current  value  of  if* ,  Not*  that  the  corratpnnding  decoder  itiucture,  alio  ihown  in  I'lg  4,  hut  a 
pole- aero  tynthula  fitter,  where  ft3  contribute*  aero*  and  ft  i  pole*  to  th*  lynlheri*  filler, 

Indeed,  the  pole-aero  filler  may  alto  provide  a  more  veraatil*  model  of  Ih*  human  vocal  tract  If  Indeed  e  suitable  number 
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Fig.  t  LPC  Vocoder  •  Uncoder 

encrypting  ■  bit  meant  and  digital  modulation  into  an  analog  voiceband  lignai  Dutiable  for  Uantmiuiou  over  an  analog 
telephone  connection. 


PITCH  PREDICTION 

Another  fundamental  technique  that  has  had  a  major  impact  on  ipoech  coding  it  the  uae  of  long-term  or  "pitch"  prediction. 
The  perk-ltc,  or  nearly-perlodlc  character  of  a  ipeech  segment  suggest*  that  then  it  contiderable  redundancy  that  can  be 
exploited  by  predicting  current  aantpiea  from  aampiei  obaerved  one  period  earlier.  Decauae  thin  periodicity  ia  cloaely 
auociated  with  the  to -called  fundamental  frequency  or  pitch  of  voiced  ipeech,  the  number  of  glottal  opening  i  per  second, 
the  repetition  period  ii  often  called  the  pitch  ptriod.  A  long-term  predictor  or  "pitch  predictor"  can  be  directly  uted  to 
remove  the  periodicity  when  the  period  ii  known.  The  phraie  "long-term"  tefert  to  the  relatively  large  delay  (many  tam- 
plei)  uted  In  pitch  prediction  compared  to  the  mull  valuer  for  the  thort-term  predictor.  Thua,  a  pitch  predictor  typically 
hat  the  tranifor  function 

Pt(  0"  £  oqr*  ' 

whan  m  it  the  pilch  period  meaaured  In  aampiei,  I  li  a  email  integer,  and  a<  are  coefficient!,  Often  a  tingle  tap  predictor 
it  uaed  to  that  i  »0.  The  filler  itructura  with  tranafer  function  I  -.**/,(»)  removei  periodicity,  and  thereby  redundancy,  by 
subtracting  the  predicted  value  from  (he  current  aample.  Thla  give*  riae  io  a  pitch  synthesis  filter,  with  the  Invert®  tranafer 
function  (l-/*t(i)rl  which  introducei  a  periodiu  character  to  an  aperiodic  input.  We  shall  ace  how  the  pitch  aynthcsli 
filter  or  lonq-ttrm  eynlhoala  filter  will  play  an  Important  role  In  the  new  generation  of  spooch  coders. 

The  computation  of  the  pitch  predictor  parameters,  i.e,  the  pitch  period  and  predictor  coefficient!,  can  be  performed  by  the 
encoder  in  a  manner  aimilar  to  that  used  for  LPC  analysis  where  the  buffered  Input  speech  is  used  to  compute  the  predictor 
(tarametert,  This  is  called  an  op*  n- hop  pitch  analysis  in  contrast  with  a  more  recent  method,  io  be  described  later,  which 
optimises  the  pilch  predictor  by  directly  measuring  Its  impact  on  the  overall  quality  of  the  speech  reproduced  by  the 
decoder, 
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Fig.  7  Adaptive  Predictive  Coder 


ADAPTIVE  PREDICTIVE  CODING  (APC) 

Tbs  oldest  waveform  coding  technique  which  makes  uae  of  pitch  prediction  can  be  viewed  is  a  sophisticated  version  of 
ADHCM,  One  version  of  an  APC  encoder  Is  shown  In  Pig.  7,  It  clearly  resembles  the  predictive  coder  of  Fig.  2,  In  fact, 
the  main  difference  in  thla  atiucture  Is  th*  addition  of  a  pitch  predictor  to  further  remove  redundancy  ftom  the  input 


lamploi  prior  to  quantization.  In  this  scheme,  we  subtract  from  the  input  ample  a  ahort  term  prediction  V*  and  then  sub¬ 
tract  a  long-term  prediction  to  produce  a  difference  signal  4  that  has  very  little  redundancy  compared  to  the  original 
sequence  of  speech  samples  Xk .  Note  that  with  this  structure,  the  exact  seme  prediction  values  are  added  back  to  the  quan¬ 
tized  difference  signal  so  that  we  have,  as  in  DPCM,  the  property  that  the  overall  error  between  the  original  speech  and 
the  reconstructed  speech  Xk  is  equal  to  the  quantization  error 

A  crucial  distinction  between  AFC  and  DPCM,  not  indicated  in  the  figure,  ii  that  the  short-  and  long-term  predictors  are 
updated  for  every  frame,  by  directly  computing  the  necessary  parameters  from  a  frame  of  speech  stored  in  an  input  buffer 
prior  to  being  encoded.  This  implies  that  side  information  describing  the  predictor  parameter!  must  be  multiplexed  with 
the  bits  produced  by  the  quantiser  to  specify  the  difference  signal  often  called  the  prediction  mutual  signal.  In  fact,  in 
typical  APC  coders  a  rather  low  bit-rate  is  found  to  be  adequate  to  code  the  residtiAl  signal 

The  decoder  for  this  APC  scheme  is  also  shown  in  Pig,  7  and  it  is  evident  that  it  reproduces  the  same  sample  sequence  J if* 
as  generated  in  the  encoder. 

What  is  most  noteworthy  about  the  decoder  structure  is  that  the  speech  is  being  regenerated  or  lynthtxiud  by  applying  a 
signal  q  to  a  cascade  of  two  synthesis  filters.  If  a  reasonably  good  job  was  done  in  determining  prediction  parameters  and 
updating  them  at  a  reaaonably  frequent  rate,  e.g„  a  frame  rate  of  20  ma,  it  it  found  that  this  signal  is  very  closely  described 
as  white  Gauiaian  noise,  Thus  in  effect,  we  are  syntheaizing  speech  from  a  time-varying  speech  production  Alter  by 
applying  to  it  a  particular  white  noire  excitation.  Thia  paradigm  will  recur  again  in  subsequent  discussions. 

various  enhancements  of  APC  have  been  developed,  and  in  particular,  quantisation  of  the  residual  combined  with  entropy 
coding  ii  often  uaed.  The  APC  structure  can  be  modified  by  Interchanging  the  rote  of  long-  and  short-term  prediction. 
APC  speech  coders  have  been  implemented  and  uaed  in  the  1970a  at  typical  bit  tatea  of  9.6  kb/s  and  16  kb/s.  in  the  put 
decade,  however,  APC  haa  gradually  diminished  in  interest  due  to  the  emergence  of  newer  and  more  powerful  speech  cod¬ 
ing  methods. 


VECTOR  QUANTIZATION 

It  has  became  recognized  in  the  put  decade  that  the  efficient  coding  of  a  vector,  an  ordered  set  of  signal  samples  or 
parameter  values  describing  a  signal,  can  be  achieved  by  pre-storing  a  codebook  of  predesigned  codo  vectors.  For  a  given 
input  vector,  the  encoder  then  simply  identifies  the  address,  or  Index,  of  the  best  matching  code  vector.  Note  that  thia  ii  in 
oaicnce  a  pattern  matching  algorithm.  The  index,  u  a  binary  word,  ii  then  transmitted  snd  the  decoder  replicates  die 
corresponding  code  vector  by  a  table-lookup  from  a  copy  of  the  same  codebook.  In  thia  way,  the  vector  component!  are 
not  coded  individually  u  in  scalar  quantization,  but  rather  all  at  once.  Considerable  efficiency  is  achieved,  fractional  bit 
rates  (bits  per  voclor  component)  becorre  possible,  and  the  average  distortion  (i.e.,  average  squared  error  per  component) 
for  a  given  bit  rate  gets  much  reduced.  Fig.  8  Illustrates  the  basic  Idea  of  vector  quantization  (VQ). 


Fig,  8  Vector  Quantization 


The  first  major  application  of  VQ  to  speech  coding  was  repotted  by  (1]  where  the  bit  rate  of  an  LPC  vocoder  was  substan¬ 
tially  reduced  by  applying  VQ  to  the  LPC  parameters.  Subsequendy  VQ  found  its  way  into  waveform  coding  as  well  and 
in  particular  a  generalization  of  DPCM  using  vector  prediction  together  with  VQ  was  reported  in  [2].  Today  VQ  is  t 
well-established  and  widely  used  technique.  It  hu  been  applied  to  tire  efficient  coding  of  the  I -PC  parameter  set,  the  pitch 
predictor  filter  parameters,  u  well  as  to  Vector  PCM  (VPCM),  the  coding  of  a  waveform  by  partitioning  it  into  consecu¬ 
tive  blocks  (vectors)  of  samples. 

OPEN  LOOP  VECTOR  PREDICTIVE  CODING 

To  illustrate  the  use  of  VQ,  let  us  return  to  the  APC  scheme  described  above  and  consider  that  the  largest  contribution  to 
the  bit-rate  of  an  APC  coder  ia  the  coding  of  the  residual  waveform.  However,  in  the  structure  of  Fig.  7  the  residual  is 
generated  only  one  sample  at  a  time,  and  the  next  residual  sample  depends  on  feeding  back  the  previous  sample  for  obtain¬ 
ing  the  next  short-term  prediction.  Thus  the  structure  is  not  immediately  amenable  to  VQ  which  requires  storing  up  a 
block  of  residual  samples  before  performing  the  pattern  matching  operation.  There  are  two  ways  to  circumvent  this  obsta¬ 
cle.  One  is  based  on  a  vector  generalization  of  ADPCM  introduced  in  [2]  and  extended  to  a  vector  version  of  APC  in  [3]. 
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The  other  i»  (imply  to  modify  (be  encodor  attucture  by  removing  (he  feedback  (round  (be  quantizer,  and  generate  the  prod- 
ictioa  retkhid  by  an  open-loop  method  ai  U  ahown  in  Pig.  V.  Note  that  (he  decoder  hai  the  lame  aytitheiii  filler  itructure 
ai  that  of  the  more  conventional  APC  scheme,  Here  VPCM  ii  applied  to  the  raaidual  signal,  and  since  many  of  id  templet 
may  be  encoded  by  a  few  hits,  fractional  bit  raiea  (i,c.  leu  than  1  bit  per  (ample)  can  be  attained, 
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Fig.  9  Residual  Encoding  with  Vector  Quantitation 


Although  thin  achome  haa  been  applied  by  several  reaearoher*  to  ipeech  coding,  it  suffers  flrom  one  mijor  diaedvanUge. 
Unlike  (be  previoui  APC  scheme,  the  overall  error  between  original  and  reproduced  ipeech  in  thin  coder  la  not  equal  to  the 
error  produced  by  the  quant! aer.  Ordinarily,  a  VQ  codabook  la  optimally  daalgnod  to  minimise  the  avenge  dlatortion 
between  input  and  reproduced  vectora  and  encoding  ii  performed  by  limply  (electing  the  code  vector  beat  matching  given 
input  vector.  In  thla  coding  acheme  thin  Impllei  that  the  raprothiced  residual  it  made  to  approximate  the  unquantized  raai¬ 
dual  u  cloaely  u  pouible.  However,  thin  ii  not  an  optimal  atniogy,  alnoe  our  objective  ii  to  make  (lie  reproduced  apart  A 
u  cloae  u  pouible  to  the  original  apoech.  With  the  predictor  tllten  time-varying,  these  turn  out  not  to  be  identical  cri¬ 
teria,  ai  the  mlatlonahlp  between  the  error  in  quantizing  the  reaidual  to  the  error  in  reproducing  the  original  speech  la  a 
very  complex  one  and  varies  from  frame  to  frame, 

The*#  obaetvationi  auggeat  that  regardleu  of  whether  we  uae  Kalar  or  vector  quantization  or  any  other  meohaniam  for 
digitally  (pacifying  an  excitation  aigiial  for  the  decoder,  the  main  teak  for  the  encoder  Ii  to  figure  out  what  excitation  will 
do  the  beat  job  of  reproducing  the  original  ipeech.  The  encoder  atiucture  of  Pig.  9  Incorporate!  a  aoinewhal  mJ  hue 
mechanism  for  selecting  an  excitation  vector  from  the  codebook,  which  focuaca  narrowly  on  the  reaidual  algnal,  nithor 
then  on  the  ipeech  llaelf.  Thla  la  an  Intrinsic  limitation  of  the  open  loop  atiucture. 

Ixt  ua  therefore  diacard  thla  encoder,  and  conaider  what  ia  the  beat  pouible  atnicture  that  can  be  uaed  to  supply  (lata  to  the 
decoder  gh«n  in  Pig.  9.  Thia  perspective  haa  led  to  a  new  generatiun  of  coding  technique!,  often  called  hybrid  coding 
method*,  which  are  baaod  on  the  use  of  analysls-by-jymhtsts  to  determine  the  beat  excitation  aignal  that  will  load  to  an 
effective  reproduction  of  the  original  apoech. 

ANALYSIS-HY-SYNTHESIS  EXCITATION  CODING 

We  now  examine  the  moat  important  family  of  apeech  coding  algorithmi  known  today,  (Inscribed  aa  Anulyxtx-by-Synthenix 
Excitation  Coilin#  or  more  conclaely  Excitation  Coding.  Conaider  the  general  decoder  structure  of  I'lg,  10,  consisting  of  a 
synthoaia  hirer  (usually  a  cascade  of  both  long-  and  short-term  filters)  *o  which  is  applied  on  excitation  signal  which  Is 
somehow  specified  by  bits  sent  by  the  encoder, 
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Fig.  10  Excitation  Ceding 


Tha  aytUhaal*  lllteii  ai<<  periodically  updated,  umtally  by  ecparaf*  lid*  Information  from  tha  Uinantlttar,  Th*  IK'  analyili 
(Mb  ii  ilaaxioal  and  ureigStfreWard,  and  w#  pay  no  fVitbai  <>  'eflllort  to  it  hcrt,  The  open-loop  method  fot  remqHitirtg  the 
pitch  predictor,  which  yield*  th*  aynthwli  Alter  p*rem*t*n,  w«*  daacribed  eerller. 

The  encoder  ronuliu  a  copy  id  the  d*.  refer  no  that  lot  any  emlUUret  wavrfrem  it  can  grnrretc  ‘Itc  name  nlgml  an 
the  decoder  would.  Ulvett  a  bit  allooetloN  and  a  machattlam  fur  gtoeretUqi  auch  waveform*,  th*  attcudar  actually  penetalea 
by  trial  and  error  all  poaalbl*  exultation  ligiuii  for  **uh  lima  aagatttu.  Th*  itay  idea  hare  ii  that  we  try  i  large  family  of 
preudbl*  aacitatlon  aegmanU  and  than  apply  aach  member  ill  hurt  to  th*  ayntltatii  Altar  {the  apaacH  production  modal),  Tor 
each  aynthaaliod  Mgmeru  wo  can  compute  a  quantitative  diMorikm  maaatire,  which  tndkalaa  how  badly  the  legmen! 
dlffnre  from  the  intended  original,  Thir  preccaa  la  repeated  until  tlw  treat  aacitatlon  **gm»nl  Ii  found.  Than,  and  only 
then,  Ii  thn  binary  wont  ipacifytng  the  brat  eacltatlon  aa|m«nt  liailkmlttrd, 

lit*  teak  or  ttndinti  an  appropdata  aaoltatton  atgrtal  copylni  the  doc  odor  at  the  encoder,  cut  be  viewed  aa  ah  Hmtlyti'r  p.w- 
catv,  line*  In  avnie  mom  wa  are  eatrautlug  an  appropdata  excitation  alynal  rront  tha  original  apeech,  lit*  mrlhod  la  called 
urw/yj/i  hy  ryn/Arifj  becauie  thla  ia  dot*  by  lyntheelaing  lha  apoach  augment  that  each  cairdldata  eacllathm  would  pto 
due*  to  examine  how  wall  it  reptreluce*  th*  original  apaach. 

There  are  Ihre*  pdncljHtl  machanlamx  for  gwtarellng  excitation  klgnalx  for  th1*  ulaaa  uf  coding  aviiattta,  known  a*  Ire*  or 
tiellta  coding,  nmlllpulaa  cttdlng,  and  VQ.  While  ail  three  are  uf  Ihtimi,  lit#  (hint  ii  moil  widely  need,  end  w*  focua  ret 
lltii  approach  in  the  aequel,  Tha  ganadc  coding  algorithm  fair  thr  tut#  of  a  VQ  eodebook  ia  called  Vaclnr  llxcllallret  Owl 
ittg  (VXC),  alio  known  aa  Crek-ttxcllad  Unaar  Prediction  (CPU'),  Thla  ha#  lad  to  many  powerful  apetvh  creleta  for  hit 
ralaa  ranging  from  4,N  to  I A  kb/*, 

VKCTOK  MUTATION  CODING 

A  ganariu  VXCriacodar  atructure  ia  ahown  in  Pig,  1 1.  It  la  natural  to  dMutlba  tha  dacudat  flrxt  line*  it  dataimhw*  how  lit* 
i|ia*ch  can  b*  aynlhexlaad  from  Iranimllted  data,  Th?n  encoder  la  lit  a  mum  a  aatvanl  uf  th*  dacodtr,  klitct  ill  Job  L  Iti 
"eamiiM  tlw  oiiglml  i|taach  and  determine  lit*  beat  data  to  aupply  (It*  dacudat  lit*  devudei  twelve*  mil  demultlplem 
lit*  data  needed  to  ipaclfy  th*  aynthaaia  niter  parematara,  Ilia  axcllatitm  cud*  vector,  and  In  addlllott,  a  gain  waling  lector. 
A  atandanl  toctmhiua  lit  VQ  Ii  to  taka  advantage  of  lha  fact  that  owing  to  (Ire  wide  dynamic  tang*  of  iqwtcli.  ilmilatly 
ahapad  waveform  portion*  may  occur  with  different  amplitude*.  *c  that  ret*  may  hhlboia  to  aach  Mgnwitl  a  "gain"  and  a 
"litape"  property.  The**  NUiltulce  can  than  be  Itandiad  aapaixlaly  via  tliffaretll  vudabookk,  avoiding  lha  Inameitl  duplkx 
lion  uf  waveform  ugnttmlk  of  similar  nltepa,  differing  retly  In  energy,  tly  thla  mathrei  Itoth  codebook  *laa  mid  watch 
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Kly,  It  VXC  Decoder 

complexity  can  be  redded, 

It  hai  been  found  empirically  that  tha  parameter*  of  the  ayntheaia  Alter  need  to  be  updatod  leu  frequently  than  new  exult#' 
lion  vector*  need  to  be  luppltod,  For  a  4,8  kb/a  coder  a  typical  frame  ilia,  i.e.  the  time  ipan  between  aucceaalve  update*  of 
the  ayntheaia  filter,  ia  20-30  mi,  while  the  excitation  vector  dimenxion,  called  xubfhunt,  may  be  a  quarter  of  title.  Por 
higher  bit-rate  coder*  there  may  be  even  more  xubfnunex  in  a  frame, 

For  each  aubframe,  the  decoder  receive*  a  lequwce  of  c  +  t  excitation  coda  biu  which  identify  a  pair  of  indexea  which 
ipeclfy  one  of  2°  excitation  code  vector*  and  one  of  2*  gain  level*,  both  by  meant  of  a  table- lookup  procedure,  Thla 
kadi  to  a  gain-icaled  excitation  vector  with  dimenaion  k .  Thla  vector  ia  leriailied  u  k  auoceuive  aamplea  and  ia  applied 
to  the  ayntheaia  filter.  The  Altar  ia  clocked  for  k  aamplea,  (hading  out  tha  next  k  xamplex  of  the  ayntbeaixed  xpeech;  then  it 
la  "frozen"  until  the  next  icaled  excitation  vector  la  available  u  the  next  input  augment  to  the  ayntheaia  filler. 

In  many  application*  an  adaptive  poatfllter  U  added  to  the  decoder  aa  a  final  poatproceuing  Mayo,  to  enhance  the  quality  of 
the  recovered  apsech  Thia  filter  ia  adapted  to  correapond  to  the  abort  term  spectrum  of  the  ipoech,  We  ahull  later  deacribe 
the  operation  of  the  adaptive  poatfllter,  however,  for  now  we  ignore  it  since  it  ia  not  a  fundamental  or  eaaentlxl  component 
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Ilia  V,\C envndei  alrutwr*  U  ahuwn  in  I'll  II.  We  deatidw  III  ujmalhm  in  lha  wav.  while  Ipmw Inn  |Im>  many 

ahwt  cull  and  liieka  which  gu.'tly  mluca  Dm  uunplralty  invnlvad  In  lha  aeaith  pMweaa  'll**  annalm  in  rlvea  Inpiil 
a|waeh  tamjdat  which  an  gtnuped  lni:«  hlmka  of  k  cmillgtmua  iam|4M,  Well  ifkankil  a«  a  vnlm,  <  Hi  tin  anlval  nl  a  at  h 
aUch  vevlm,  llw  laak  of  (hat  encode!  (a  lu  tk'atmliw  llw  neal  i  i  4  till!  til  tlaia  In  ha  lieimnllinl  In  tin  iln  ndei  an  llial  Ilia 
tKodet  will  Ihan  be  able  In  aynlliaalaa  a  Wimi.nuvled  nulpul  B|wnh  vnhn  llial  i  lnaal'  apptoahitalet  llw  «h Inina!  instil 
»|wnh  vecltn , 

ilila  lni|i|laa  Dial  llw  vmndei  embody  a  i»|ilka  uf  llwdanudei,  <t  ahnwn  Inl'lg,  U.  wlilcli  nail  ba  ally  ganaialarath  id  lha 
N  -  ir  *•  |aiaalbl«  ijwnli  vnlot  candldami  IbM  Ilia  tln'odat  Wnui.*  |Nmiuu<  Ini  Ilia  lamu  liaiitiiillltal  ilala  valuta  link 
aval,  ilia  iopll»a  ihtmlei  ihwa  mu  Include  Ilia  |aialtlll»i  uaad  In  llw  etlim'dacudai 

In  ixdar  In  ataroh  lot  a  i«t*uduv(lmt  lhal  (a  uloaeat,  In  a  iwivaiaually  iiaauliiahil  wiiat,  in  llw  uilglmd  ajtan-h,  apto  t/««.d 
mIhMhji  flliir  la  tiled  lu  muddy  both  lha  wlghtal  lii|Hil  atwevh  and  llw  lacuneiiutivd  nui|ait  a|waaii  vnltu  Iwlme  llw  tlla 
uutlwi  balwaan  llw  Iwu  la  maaauied.  NiKe  lhal  lha  weighting  him  la  combined  wllb  lln  aynllwala  llllai  in  give  a  wtighlrd 
aynihaala  llllai  with  a  modified  tranafar  function  llial  la  dlalliiil  from  llw  a)nllwala  llllai  iih.I  In  llw  ilnodet.  N|wevh  aam 
plea  emerging  fiont  llw  weighting  filial  am  alan  cmtflguird  Inin  i'mitre|Nmillng  vnlma  nl  k  iMilIgtluiia  aani|ilta,  tailed 
"weighted  i|wech  vaclmi" 

Nlnca  llw  lepllca  decoder  la  operating  i«|wai«dly  In  llw  aaaivli  ptuceaa,  wa  mini  euauio  llial  ttw.'h  1  amlidaia  nul|Hd  »|w*cli 
vaclur,  corniapunding  in  •  candidal#  dau  tndaa  pair  balni  tailed,  la  produced  umJai  lU  aanw  I'undlllnna  aa  will  lw  |Ht  win 
whan  lha  actual  dacndar  getwtalea  lha  naal  uulfail  vector,  Aflat  each  laal  nl  a  vamlldala  imlaa  pali,  llw  mammy  aiaia  nl 
llw  tapllca  dacndar  liu  changed  and  li  mi  lunna<  al  lha  coincl  Initial  urnJIilnn  tm  ilia  iwal  laal,  Theiefuw,  Iwfmv  gen 
erallni  each  nf  thaaa  uandidalaa,  llw  nwmmy  In  lha  lapitca  decoder  (Including  llw  |Witaptual  walghllii)|  fillet)  mual  alan  be 
rniwt  In  (ha  con  ad  Initial  condition! , 

The  error  mlnlmiuulon  aaareh  module  m-quandally  gatwteiai  a  iwlr  of  laal  Indaata,  cornu  ponding  in  a  (NUIIculai  |Wlr  nf 
coda  vector  and  gain  laval,  Ifwaa  an  fad  10  lha  rapiloa  of  llw  tiwodai  which  gmwialu*  a  aynllwaliMHl  a|wach  vadm  lhal 
would  ba  produced  by  lha  actual  decodar  If  ihli  Imlaa  pair  wain  actually  Iranamillad.  lha  replica  dwndai  la  liilllalliwd  by 
tatting  lha  walghied  aynlheali  (tiler  memory  to  Ihua*  initial  condition!  lhal  wen  dalaimliwd  ailai  llw  |ulm  aeatvli  jnncaaa 
waa  completed.  Then,  ihn  tail  Indea,  U  a|>plled  to  (he  eaniialiun  codebook  end  llw  yaln  imlaa  lu  llw  gain  code  book,  yield 
ing  a  gain  end  an  racitation  vector,  llw  gain  acaled  excitation  vector  li  than  applied  lu  the  weighted  lynlheali  filler  lo  pirn 
duca  lha  output  vactor  r,.  Tha  vector  r,  la  then  aubtradad  irum  the  Input  upcwli  vmoi  vH  and  the  dlalmtluii  Iwlwoan 
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these  two  vectors,  i.t.,  the  sum  of  the  squares  of  the  components  of  the  difference  vector,  is  computed  by  the  distortion 
computation  module.  This  error  value  is  applied  to  the  search  module  which  stores  the  distortion  value,  compares  it  with 
the  lowest  distortion  value  obtained  so  far  in  the  current  search  process,  and,  if  appropriate,  updates  the  lowest  distortion 
value  and  the  corresponding  vector  index. 

VECTOR  SUM  EXCITATION  CODEBOOKS 

A  question  of  practical  importance,  is  how  the  quality  of  i  given  VXC  coder  can  be  improved  if  more  bits  are  made  avail¬ 
able  and  to  which  components  of  the  coder  should  these  bits  be  assigned  for  the  maximum  benefit.  It  is  generally  recog¬ 
nized,  that  the  best  performance  gain  comes  from  increasing  the  codebook  size.  However  adding  just  one  bit  per  code  vec¬ 
tor  doubles  the  code  book  size  and  the  corresponding  search  complexity.  Thus,  computational  constraints  of  the  available 
signal  processor  quickly  force  one  to  limit  the  codebook  size  and  lead  to  alternative  designs  where  the  vector  dimension  is 
reduced  and  more  bits  are  given  to  synthesis  filter  parameters.  The  use  of  specially  constrained  codebook  structures  offers 
the  possibility  of  larger  codebooks  and  significant  performance  improvements  while  maintaining  tolerable  complexity. 

Gerson  and  Jasiuk  recently  introduced  technique  for  reducing  the  complexity  of  the  excitation  codebook  search  pro¬ 
cedure^].  Rather  than  have  each  of  M  code  vectors  be  independently  generated  either  randomly  or  by  a  design  procedure, 
they  design  b  basis  vectors  and  then  generate  the  M  ■=  2b  code  vectors  by  taking  binary  linear  combinations  of  the  basis 
vectors.  The  resulting  coding  algorithm,  a  Qcrvative  of  VXC,  is  called  Vector  burn  Excited  Linear  Prediction  (VSELP) 
and  an  8  kb/s  version  of  this  algorithm  has  been  adopted  as  a  standard  for  the  U.S.  cellular  mobile  telephone  Industry.  We 
next  explain  the  basic  idea  of  this  technique  for  fast  codebook  search. 

Let  v/  denote  the  i  th  basis  vector  of  a  given  set  of  b  basis  vectors.  The  code  vectors  are  then  formed  us 

ue“  ijfyv/ 

by  taking  all  possible  linear  combinations  where  0/  a  ±i  tor  each  i ,  Thus  each  binary-valued  vector  0  determines  a  partic¬ 
ular  code  vector  14.  Naturally,  the  b  bit  binary  word  trtnsr.dtted  over  the  channel  can  simply  correspond  to  a  mapping  of 
0  values  with  +1  being  a  binary  1  and  --1  being  a  binary  0.  Sincr  the  code  vectors  ure  so  simply  generated,  b  basis  vectors 
need  be  stored  rather  than  storing  an  entire  codeboox  of  M  code  vectors. 

This  special  codebook  structure  can  be  searched  very  efficiently,  Instead  of  finding  the  vector  output  of  the  weighted  syn¬ 
thesis  tllter  for  each  of  the  M  codrvectom,  only  the  filtered  output  of  the  b  basis  vectors  need  be  determined  becuuse  from 
these  any  synthesized  output  can  be  readily  obtained  by  addilion.  Furthermore  the  search  for  the  optimul  codevector 
becomes  computationally  simplified  by  noting  that  the  mean-squared  error  between  the  weighted  input  vector  und  u 
tillered  codevector  depends  In  a  simple  manner  on  the  values  of  0;.  By  ordering  the  b  bit  binury  word  in  a  Gray  code,  only 
one  bit  changes  from  one  word  to  the  next.  This  means  that  only  a  simple  chango  is  needed  to  compute  the  mean-squurcti 
error  for  the  next  candidate  code  vector  from  the  previous  candidate  code  vector. 

The  vector  sunt  approach  can  be  augmented  by  using  multlple-stagn  VXC  |5|,  and  Joint  optimization  of  the  guilts  fur  each 
stage.  The  joint  optimization  becomes  cany  to  implement  with  the  vector  sum  codebooks  (4). 

CLOSED-LOOP  PITCH  SYNTHESIS  FILTERING 

An  alternative  and  improved  method  of  designing  Ihe  long-term  predictor  (1,TP)  filter  was  llrst  proposed  lor  the  mul¬ 
tipulse  excitation  coder  |6]  and  later  applied  to  vector  excitation  coders  [7|  |8j  |9],  proposed  for  .nultipulnc  ,  (citation 
coding  and  subsequently  applied  to  VXC.  Although  it  is  of  higher  complexity  and  requires  a  higher  bit-rule,  ll  docs  offer 
superior  pt.'Vmince,  Furthermore,  when  the  closed-loop  LIT  is  used,  the  size  of  the  execution  oodebouk  Is  reduced  und 
hence  (he  com.  itlonui  load  U  reduced. 

The  pitch  lag  und  predictor  coefficients  of  a  closed-loop  LTP  are  chosen  in  such  wuy  that  the  ntcun  square  of  the  perceptu¬ 
ally  weighted  reconstruction  error  vector  Is  minimized. 

For  i  oie-tap  LIT,  the  predictor  parameters  can  be  determined  two  slops:  (a)  find  the  pilch  lag  m  (from  a  predefined 
range)  (hat  maximizes  1  quantify  that  1s  independent  of  the  prediction  coefficient,  and  (b)  compute  the  prediction 
coefficient  from  a  simple  formula. 

In  the  ctoaed-loop  LTP  method,  the  pitch  lag  ordinarily  has  to  be  greater  or  equal  to  the  speech  vector  dimension  in  order 
to  obtain  the  previous  LTP  output  vector.  Hence,  Ihe  vector  dimension,  which  is  also  Ihe  adaptation  Interval  of  the  LIT, 
needs  to  be  reasonably  small  to  handle  short  pitch  periods.  Decreasing  the  adaptation  Interval  increases  the  bit  ,aio  needed 
to  code  the  LTP  parameters, 


ADAPTIVE  POSTPIL’ TURING 

As  already  discussed,  ihe  perceptual  weighting  Alter  Is  a  valuable  component  of  a  VXC  encoder  since  il  exploila  the  mask¬ 
ing  effect  in  human  hearing  b-‘  removing  quantization  noise  from  exposed  frequency  regions  where  Ihe  signal  eneigy  is 
low,  und  "hiding"  it  under  spectra!  (teaks.  At  hit  rales  us  low  as  4  8  kb/s  or  less,  however,  Ihe  uvomgo  noise  level  Is  quite 
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high  and  thus  it  is  not  possible  to  simultaneously  keep  the  noise  below  the  masking  threshold  at  spectral  valloyi  as  well  as 
I  at  formant  frequencies.  Since  the  formant  peaks  are  more  critical  for  perceptual  quality,  at  low  bit  rater,  the  weighting 

filter  tends  to  protect  these  regions  while  tolerating  noise  above  the  threshold  in  the  valleys.  The  technique  of  adaptive 
postfiltering  attempts  to  rectify  this  by  selectively  attenuating  the  reproduced  speech  signal  in  the  spectral  valleys.  This 
somewhat  distorts  the  speech  spectrum  in  the  valleys  but  it  also  reduces  the  audible  noise.  Since  a  faithful  reproduction  of 
the  spectral  shape  is  perceptually  much  less  important  in  the  valleys  than  near  formants,  the  overall  effect  is  beneficial  and 
leads  to  a  notable  improvement  in  subjective  speech  quality. 


A  more  primitive  form  of  adaptive  postflltering  to  enhance  performance  was  applied  to  ADPCM  by  Ramamooithy  and 
JayantflO]  and  to  APC  by  Yatsuzuka  [1 1].  Recently,  an  improved  version  for  adaptive  postfiltering  was  found  [12]  whloh 
is  effective  for  VXC  (or  CELP). 

For  a  filter  to  attenuate  the  apectral  valleys,  It  must  adapt  to  the  time-varying  spectrum  of  the  speech,  The  synthesis  filter 
parameters  provide  the  needed  information  to  identify  the  location  of  these  valleys  and  are  thus  used  to  periodically  update 
the  postfllter  parameters,  since  the  LPC  spectrum  of  a  voiced  sound  typically  has  tilts  downward  at  about  6  dB  per  ooUve, 
the  corresponding  all-pole  postfllter  will  also  have  such  a  tilt  causing  undoairable  muffling  of  the  sound.  This  can  be  over¬ 
come  by  augmenting  the  postfllter  with  zeros  at  the  same  or  similar  angles  as  the  poles  but  with  smaller  radii.  The  idee  is 
to  generate  a  numerator  transfer  function  that  compensates  for  tha  smoothed  spectral  shape  of  the  denominator.  The 
overall  transfer  function  used  for  the  postfllter  in[12]  Is  a  pole-zero  transfer  function,  given  by: 


Figure  13  shows  the  spectral  magnitude  of  an  all-pola  filter  { l-f’l(t/a))~1  for  different  velueu  of  a  end  for  a  particular  LPC 
speech  frame.  Note  (he  spectral  tilt  effect  that  arises  here.  For  comparison,  the  frequency  response  of  the  pole  aero 
postfllter  is  shown  In  Fig,  14  where  the  spectral  tilt  (and  tssoclatad  muffling  sfftet)  art  substantially  reduced.  Since  the 
transfer  function  of  the  postfllter  changes  with  each  speech  frame,  a  time-varying  gain  is  produced.  To  avoid  this  affect, 
an  automatic  gain  control  Is  used. 

We  can  think  of  (he  reproduced  speech  coming  into  (he  postfllter  as  the  sum  of  clean  speech  and  quantising  noise. 
Although  the  postfllter  is  or  course  attenuating  spectral  valleys  of  both  the  speech  and  the  noise,  the  distorting  elTeut  of  (he 
filter  on  the  ipoooh  is  negligible  due  to  the  low  sensitivity  of  the  ear  to  changes  In  the  level  or  the  spectral  valleys.  This 
ties  been  verified  by  applying  the  original  (ui ».  tided)  speech  to  the  adaptive  postfllter;  the  original  and  filtered  speech 


Klg.  13  .Spectral  Magnitude  of  All-Hole  l*uelflller[l  P, (■/«))  1 
for  different  values  of  a  and  tor  s  particular  I  FT  i poach  frame, 

sound  essentially  the  same, 

lltough  |K>atfiltsring  clearly  Improves  the  performance  of  a  single  code*,  when  multiple  stages  of  coding  and  decoding  fol¬ 
low  each  other,  thr  postfllter  in  each  stage  introduce*  a  slight  degradation  that  accumulates  with  the  number  of  stage*, 
post/! Itering  may  thua  not  be  desired  for  application*  with  undenting, 

Pole-zero  adaptive  poufllteri&x  following  the  approach  deauribed  above  lias  been  included  in  the  U.S,  digital  oellular  tele¬ 
phone  standard  for  H  kb/s  speech  coding  and  as  an  optional  feature  In  the  U,S,  government  standard  for  4,1  kb/s  speech 
coding.  [131.  Uoth  standards  are  derivatives  of  VXC 
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LOW  DELAY  VXC 

Vector  Hxdtatiiw  Coding  (VXC)  combioM  teohuquea  auch  u  vaotor  quantisation,  uuJynin-by*nvn(bMi«  codebook 
Mucking,  perceptual  weighting,  and  linear  predictive  coding  lo  aucwaaftdly  echiav#  good  ipeech  quality  at  tow  bit  nloi. 
However,  one  important  aapeol  of  coding  baa  boon  ignored  in  tka  davniopawnt  of  VXC  or  other  conventional  low  kit  rate 
•imitation  coding  schema*)  that  U  tka  carting  tdtky,  In  faot,  moat  flirting  i peach  coder*  with  ntna  at  ur  below  16  kbpe 
require  high  delay*  in  their  operation,  and  oauaa  vatlooa  problem  whan  may  are  applied  to  practical  communication  tyi- 
lenu,  In  caae  of  VXC,  a  huge  net  coding  delay,  excluding  computational  delay  a,  multi  ftom  the  um  of  buffering  needed 
to  perform  the  LPC  and  open-loop  pitch  analyiia.  Raoeotly,  new  method*  have  been  proponed  to  adi|A  lyntheiia  ttlton 
without  the  high  coding  delay  mentioned  above  while  maintaining  tha  quality  of  encoded  apeach, 

With  tha  conventional  VXC  aohama  described  above,  tha  aynthaaii  filter  la  adaptively  updated  every  frame  ualng  what  la 
tometlmoe  called  forward  adaptation,  tha  prooeii  of  tecompuling  and  updating 

tha  daairad  filter  parameter*  from  the  Input  apeach,  The  um  of  the  forward  adaptation  ha*  two  diaadvantagaa:  It  require* 
ueiumliaion  of  aide  Information  to  the  receiver  to  apecify  the  filter  panunetei*  and  it  lead*  to  a  large  encoding  delay  of  at 
leant  one  aiulyeia  frame  due  to  the  buffering  ur  input  ipeech  lamp)**,  The  input  buffering  and  other  prooeeiing  typically 
(•null  In  a  one-way  codac  delay  of  SO  to  00  m*.  In  ueitaln  applications  in  Ihe  talecommunicatlona  network  envlumment, 
coding  delayi  ai  low  at  2  mi  per  codec  am  required,  Recently,  Ihe  CC1TT  adopted  e  performance  requirement  of  leaa  iluin 
5  nu  delay  with  a  daalred  objective  of  leaa  than  2  mi  for  candidate  16  kbll/a  apeach  coding  algorilluui  to  be  conaidered 
for  a  new  aundaid  luleoded  to  achieve  the  seme  quality  aa  tha  .12  kb/a  ADK.'M  alandanl,  0,721,  Such  a  low  delay  I*  not 

raaalbla  with  tha  aaubllihed  coder  o  that  ara  baaed  on  forward  adaptive  prediction  coding  eyitema.  Although  Ihe  32  kbit/* 
ADIK.'M  algorithm,  CCI1T  Recommendation  0,721,  itltillea  the  low  delay  requlirmenl,  It  cannot  give  acceptable  quality 
whin  the  bit  ran  I*  reduced  to  16  kbit/*, 

Alt  altemalive  aolulion  U  band  on  a  recently  proposed  backward  adaptalion  configuration.  In  u  backward  adaptive 
enilyaU-by-iyniheaia  configuration,  the  parameter*  of  Ihe  lyntheali  ftlt«  are  not  derived  from  (ho  original  xpoeclt  nignnl, 
but  conqwted  by  backward  adaptation  extracting  Information  only  from  Ihe  sequence  of  tranamitted  codebook  indlcea. 
Since  both  the  encoder  artd  decoder  have  acceai  to  the  peat  recomltticied  algnal,  ilJe  Information  l<i  no  longer  needed  for 
eynlh*«li  tlltan,  and  tha  low  delay  t equipment  can  be  met  with  a  tollable  choice  of  vector  dlmonih  n, 

VXC  incorporated  with  backward  adaptation  to  aatitfy  the  low-delay  requirement  la  called  Low- Delay  VXC  or  Low- 
Delay  Clili',  Two  approach**  to  backward  adaptation  ara  studied,  and  tUy  are  olaailfied  u  block  and  meursivt,  In  the 
block  algorithms,  the  reconaUuuied  algnal  and  the  oormpoodlng  galn-icaled  excretion  vector*  are  divided  into  blooka 
(frame*),  and  the  optimum  parameter*  of  tit*  adaptive  filler  are  determined  Independently  within  each  block,  In  the  recur- 
live  algorithm#,  the  parameter*  ere  updated  twit  mentally  after  each  aucceaiive  exaltation  and  reconatrucled  vector  ara 
^enuraiod. 

To  achieve  the  low-delay  requirement,  two  veraiont  of  LJ5-VXC  were  proponed  to  the  CCITT.  One  urea  a  codebook  of 
dl  mention  3  and  a  very  high  older  block-adaptive  abort -term  predictor  computed  by  LPC  analyiia  on  the  previouily  repro¬ 
duced  ipeech,  The  other  ha*  *  code  book  of  dl  mention  4  and  uae«  a  recunive  backward  adaptation  method  for  a  pole-zero 
predictor  and  for  a  pitch  predictor.  With  tha  itrutdaid  sampling  rate  of  8  KHz,  we  are  tilowod  to  um  a  codebook  of  ilze 
256  at  16  kblt/a,  Simulation  remit*  ahow  that  ID-VXC  achiavei  an  SNR  of  about  20  dB  with  either  block  or  recunive 
adaptation.  Tranamiaalon  error*  ware  alio  taken  into  account  In  the  deiign  of  LD-VXC.  With  the  help  of  leaky  factor*  and 
ixwudo-gray  coding,  the  performance  of  tha  coder  only  degrade*  ilightiy  at  0,1%  error  rate,  and  intelligible  ipeech  in  pro¬ 
duced  even  at  error  rate  as  high  aa  1%,  More  details  ara  reported  in[14j  and  (15). 


VXC  WITH  PHONETIC  SEGMENTATION 

Although  VXC  achievea  fairly  high-quality  ipeech  at  4.8  kbpe,  the  performance  achieved  with  current  VXC  baaed  algo¬ 
rithm*  degrade*  rapidly  u  the  bit-rate  U  reduced  below  4.8  kbpt,  leaving  a  aubitantial  gap  between  the  natural  voice  qual¬ 
ity  of  VXC  at  4,8  kbpe  and  the  eynthetic  quality  attainable  at  2,4  kbp*  (or  higher)  with  an  LPC  vocoder.  An  important 
future  direction  for  apeoch  coding  la  to  find  coding  algorithms  that  will  achieve  at  4  kb/a  and  below  the  natural  quality 
attainable  today  with  the  beat  version*  of  VXC.  One  of  the  motivation*  for  this  interest  ii  the  next  generation  of  digital 


.vs:* 

cellular  telephones  where  i(  is  expected  that  a  bit  rate  in  the  neighborhood  of  4  kb/s  will  be  required  in  older  to  nvtet  the 
Inarming  chsnnel  capacity  objective*, 

One  research  direction  that  we  have  been  studying,  Phonetically-Segmented  Vector  cXeitatlon  Coding  (PS-VXC)[16), 
appears  to  show  promise  and  might  lead  to  a  speech  coder  operating  at  bit  rates  significantly  below  4.8  kb/s  yet  with  a 
quality  comparable  to  currant  4.8  kb/s  coders. 

In  this  method,  apeedt  la  segmented  into  a  sequence  of  contiguous  variable-length  segments  constrained  to  be  an  integer 
multiple  of  a  Iked  unit  length.  The  segments  are  classified  into  one  of  six  phonetic  categories.  This  provides  the  front- 
end  to  a  bank  of  VXC  coders  that  at  individually  tailored  to  the  different  categories. 

The  motivation  for  this  wort  derives  from  the  fact  that  phonetically  distinct  speech  segments  require  different  coding 
treatment*  for  preserving  what  we  call  phonetic  integrity.  With  phonetic  segmentation,  we  can  assign  the  wide  variety  of 
possible  speech  segments  into  a  small  number  of  phonetically  distinct  groups.  In  each  group,  different  analysis  methods 
wvi  coding  strategies  can  be  uaed  to  emphasize  tile  critical  parameters  corresponding  to  important  perceptual  cues.  It  also 
becomes  easier  to  Identify  each  individual  coding  problem  in  isolated  phonetic  groups  and  optimize  a  multi-mode  coding 
algorithm  to  suit  various  phonetic  categories. 

Table  i  summarizes  the  segment  classification  and  coding  structures  used  for  these  clssses  by  specifying  salient  features 
and  coder  parameters  for  each  of  the  six  categories.  Table  2  lists  the  bit-allocation  for  each  category  in  PS-VXC.  The 
details  of  the  coding  algorithm  and  recent  improvements  are  reported  in  [16]  and  [17]. 

The  thioe  main  segment  types,  if  coded  individually,  would  yield  rates  as  follows:  unvoiced  —  3  kb/s,  unvoiced/onset 
pairs  —  3.6  kb/s,  voiced  —  3.6  kh/s.  For  typical  speech  flies,  the  average  rate  is  3.4  kb/s,  whioh  could  be  achieved  as  a 
fixed  rate  with  buffering  of  the  encoder  output,  Alternatively,  a  fixed  rate  of  3.6  kb/s  is  readily  attainable  with  some  pad¬ 
ding  of  the  bit  stream. 

Irtfoitiuil  listening  tests  indicate  that  the  quality  at  a  fixed  3.6  kb/s  rate  ii  roughly  comparable  to  that  of  conventional  VXC 
at  4.8  kb/s.  Nevertheless,  there  is  room  for  considerable  improvement  in  both  the  coding  algorithm  for  particular  segment 
categories  and  In  the  definition  and  number  of  the  phonetic  daises  used  In  the  segmentation  process.  An  end-to-end  cod¬ 
ing  delay  of  approximately  100  ms  (including  overhead}  is  anticipated, 

CONCLUDING  REMARKS 

In  this  overview,  we  have  only  touched  the  aurface  of  the  rich  and  active  field  of  speech  coding.  We  have  described  some 
of  the  main  concept!  that  underly  speech  coding  algorithms  of  current  interest  today.  In  particular,  linear  prediction  for 
both  short  and  long  term,  analysis-by-synthesls,  vector  quantization,  perceptual  weighting  for  noise  shaping,  adaptive 
poitflltering,  closed-loop  pitch  analysis,  and  vector-sum  codebook  structures.  No  doubt  in  the  next  few  years,  there  will 
be  new  advances  that  we  cannot  anticipate  today. 

The  motivation  for  the  continued  activity  in  speech  coding  research  is  in  large  part  due  to  the  combination  of  two  factors: 
the  rapidly  advancing  technology  of  signal  processor  integrated  circuits  and  the  ever  increasing  demand  for  wireless 
mobile  and  portable  voice  communications,  The  technology  permits  increasingly  complex  and  sophisticated  signal  pro¬ 
cessing  algorithms  to  become  implementable  and  cost  effective.  Mobile  communications  and  the  emerging  wide  scale 
cordleii  portable  telephones  will  increasingly  stress  the  limited  radio  spectrum  that  is  already  pushing  researchers  to  pro¬ 
vide  lower  bit-rate  and  higher  quality  speech  coding  with  lower  power  consumption,  increasingly  miniaturized  technology, 
and  lower  cost.  The  insatiable  need  for  humans  to  communicate  with  one  another  will  continue  to  drive  speech  coding 
research  for  years  to  come. 
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ABSTRACT 

The  held  of  digital  speech  processing  includes  the  areas  of  speech 
coding,  speech  synthesis,  and  speech  recognition.  With  the  advent  of 
faster  computation  and  high  speed  VLSI  circuits,  speech  processing 
algorithms  are  becoming  more  sophisticated,  more  robust,  and  more 
reliable.  As  a  result,  significant  advances  have  been  made  in  coding, 
synthesis,  and  recognition,  but,  in  each  area,  there  still  remain  great 
challenges  in  harnessing  speech  technology  to  human  needs. 

In  the  area  of  speech  coding,  current  algorithms  perform  well  at 
bit  rates  down  to  16  kbits/sec.  Current  research  is  directed  at  further 
reducing  the  coding  rate  for  high-quality  speech  into  the  data  speed 
range,  even  as  low  as  2.4  kbits/sec.  In  text-to-speech  synthesis  we  are 
able  to  produce  speech  which  is  very  intelligible  but  is  not  yet 
completely  natural.  Current  research  aims  at  providing  higher  quality 
and  intelligibility  to  the  synthetic  speech  produced  by  these  systems. 
Finally,  in  the  area  of  speech  and  speaker  recognition,  present  systems 
provide  excellent  performance  on  limited  tasks;  i.e.,  limited 
vocabulary,  modest  syntax,  small  talker  populations,  constrained 
inputs,  and  favorable  signai-to-noise  ratios.  Current  research  is 
directed  at  solving  the  problem  of  continuous  speech  recognition  for 
large  vocabularies,  and  at  verifying  talker's  identities  from  a  limited 
amount  of  spoken  text 


L  INTRODUCTION 

Although  the  field  of  speech  processing  is  quite  broad  and 
encompasses  a  number  of  diverse  application  areas,  we  will  be 
concerned  in  this  paper  only  with  the  areas  of  speech  coding, 
synthesis,  and  recognition.  For  each  of  these  important  application 
areas  we  will  review  the  present  status  of  the  technology  and  discuss 
the  directions  in  which  research  is  beading. 

Speech  coding  is  concerned  with  communication  between  people 
and  therefore  deals  with  techniques  of  speech  transmission,  generally 
over  the  conventional  telephone  network.  Of  central  concern  here  are 
methods  for  reducing  the  required  bandwidth  (or  equivalently  the 
digital  bit  rate)  for  transmitting  speech.  Speech  synthesis,  or 
computer  voice  response  as  it  is  often  called,  is  concerned  with 
machines  talking  to  people.  Although  systems  as  simple  u 
announcement  machines  fail  into  this  area,  we  anil  primarily  be 
concerned  with  the  state  of  the  art  and  current  research  directions  in 
the  area  of  text-to-speech  synthesis  systems.  Speech  recognition  is 
concerned  with  people  talking  to  machines.  Speech  recognizers  range 
in  sophistication  from  the  simplest  isolated  word/phrase  recognition 
systems,  to  fully  conversational  recognizers  that  attempt  to  deal  with 
vocabularies  and  syntax  comparable  to  natural  language.  Also 
included  in  the  broad  area  of  speech  recognition  is  the  topic  of 
speaker  recognition  in  which  the  job  of  the  machine  is  either  to  verify 
the  claimed  identity  of  a  talker,  or  to  identify  the  individual  talker  as 
one  of  a  fixed,  known  population. 

Digital  speech  processing  has  advanced  in  the  put  few  years  for 
several  reasons.  One  key  reason  is  the  explosive  growth  in 
computational  capabilities,  supported  by  economical  VLSI  hardware. 
General  purpose  signal  processing  computers  exist  today  which  can 
run  standard  programming  languages  and  can  execute  algorithms  at 
rates  on  the  order  of  50-200  megaflops  [lj.  Such  machines  are 
classified  is  mini-supercomputers,  and  cost  less  than  main-frame 


machines  of  a  few  years  past.  Similarly,  VLSI  digital  speech 
processor  (DSP)  chips  now  exist  which  do  calculations  in  floati  .g 
point  arithmetic  at  an  8  megaflop  rate  12],  Thus  even  a  100  megaflop 
algorithm  can  potentially  be  realized  with  about  a  dozen  DSP  chips 
on  a  single  circuit  board. 

Another  reason  for  the  progress  in  digital  speech  processing  is  the 
improvements  that  have  been  made  in  speech  processing  algorithms. 
Speech  coding  has  benefited  significantly  from  the  introduction  of 
MPLPC  (multi-pulse  linear  predictive  coding)  [3],  and  CELP  (code- 
excited  linear  prediction)  f4j;  the  field  of  text-to-speech  synthesis  has 
seen  major  improvements  due  to  the  introduction  of  large  pronouncing 
dictionaries;  and  the  field  of  speech  recognition  has  seen  the  maturity 
of  algorithms  for  recognizing  connected  words  (e.g.  level  building  (51), 
and  the  widespread  acceptance  of  statistical  modelling  techniques 
(namely  hidden  Markov  models  or  HMM’s)  16,7 ]. 

Finally,  perhaps  the  greatest  recent  impetus  in  advancing  digital 
speech  processing  has  been  the  growing  need  for  products  that  serve 
real-world  applications.  The  past  decade  has  seen  major  growth  in 
the  utility  of  voice  products  for  at  least  four  market  sectors  — 
namely,  telecommunications,  business  applications,  consumer 
products,  and  government.  In  the  telecommunications  sector,  voice 
coders  are  used  for  reduced  bit-rate  transmission  and  privacy; 
repertory  name  dialers  are  used  for  hands-free  dialing;  announcement 
systems  are  used  to  speak  computer  stored  information  to  customers; 
and  a  wide  variety  of  operator  and  attendant  services  depend  upon 
recognition  and  synthesis  for  increased  utility.  In  business 
applications,  voice  mail  and  store-and-forward  services  are  already  in 
widespread  use,  and  voice  interactive  terminals  and  workstations  are 
beginning  to  appear  on  the  market.  In  the  consumer  products  and 
services  sector,  toys  using  either  synthesis  and/or  recognition  have 
been  available  for  several  years,  and  recently  residence  communication 
systems  and  alarm  announcement  systems  have  started  to  appear.  In 
the  area  of  government  communications,  anticipated  uses  include 
coding  for  secure  communications,  and  voice  control  of  military 
systems. 

The  tbove  examples,  by  no  means  exhaustive,  illustrate  the 
burgeoning  applications  of  speech  processing  and  point  to  a  growing 
market  in  the  coming  years. 

It  is  the  purpose  of  this  paper  to  outline  the  main  issues  in  speech 
coding,  synthesis,  snd  recognition,  to  indicate  where  progress  has  been 
made,  and  to  point  out  areas  where  new  research  is  necessary  to 
achieve  desired  goals. 

IL  SPEECH  CODING 

In  the  new  emerging  digital  communication  environment, 
transmission  of  digital  speech  at  low  bit  rates  without  compromising 
voice  quality  is  becoming  increasingly  important.  Low  bit  rate  voice 
will  play  a  key  role  in  providing  new  capabilities  in  future 
communication  systems  —  e.g.  for  sending  voice  mail  over  telephone 
networks,  for  integrating  voice  and  data  in  packet  systems  for 
transmission,  for  narrow  band  cellular  radio,  and  for  insuring  privacy 
in  voice  communication. 

The  speech  coding  technology  to  achieve  high  voice  quality  is  well 
developed  for  bit  rates  as  low  as  16  kbits/sec  [81.  The  major  research 
action  is  now  focused  at  bringing  the  rale  significantly  lower  than 
16  kbits/sec  without  seriously  degrading  the  speech  quality.  The 
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lower  bit  r»t«  ftcilliute  end-io-end  digiul  voice  communication  over 
dialed -up  public  telephone  tines,  and  are  important  to  spectrum 
conservation  in  mobile  radio. 

Real-time  implementation  of  low-bit-rate-voice  coders  previously 
has  been  a  difficult  and  coetly  task.  Recent  advance*  in  device 
technology  and  the  availability  of  faat  programmable  digiul  fignal 
processors  [9]  ha*  made  the  usk  easier.  We  are  now  able  to 
implement  fairly  complex  speech  processing  algorithms  on  a  single 
chip  1 10]. 

n.1  Present  Speech  Coding  Technology 

The  objective  in  speech  coding  is  to  transform  the  analog  speech 
signal  to  a  digiul  represenution.  Redundancies,  introduced  in  the 
speech  signal  during  the  human  speech  production  process,  make  it 
possible  to  encode  speech  at  low  bit  rata.  Moreover,  our  hearing 
system  is  not  equally  sensitive  to  distortions  at  different  frequencies 
and  has  a  limited  dynamic  range.  Speech  coding  techniques  take 
advanuge  of  these  properties  for  reducing  the  bit  rate.  We  can 
summarize  the  present  sutus  of  our  capabilitia  for  transmitting  high 
quality  speech  at  low  bit  rata. 

Figure  1  shows  the  variation  of  speech  quality  versus  transmission 
bit  rate  for  three  coder  technologia.  Typically,  performance  of 
speech  coders  diminisha  with  decreasing  transmission  rite.  In  Fig.  1 
the  speech  quality  is  expressed  on  a  scale  which  induda  the  terms 
excellent,  good,  fair  and  poor.  Often  speech  quality  is  expressed  in 
terms  of  a  Mean  Opinion  Score  (MOS)  on  a  5-point  scale  where  an 
MOS  of  5  is  excellent  quality,  4  is  good,  3  is  fair,  2  is  poor,  and  1  is 
unsatisfactory. 

BIT  RATE  VERSUS  SPEECH  QUALITY  FOR  SPEECH  CODERS 
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The  bit  rate  of  16  kbits/sec  is  suiuble  for  a  variety  of  applications, 
such  as  voice  mail,  secure  voice  over  wide-band  cellular  radio 
channel!,  and  integrated  transmission  of  voice  and  data  over  packet 
networks.  The  coding  technology  to  achieve  high  quality  at 
16  kbiu/sec  is  available  at  present.  These  coding  techniques  are  more 
complex  in  comparison  to  ones  used  in  standard  PCM  and  ADPCM 
coders,  but  they  can  be  implemented  on  a  single  digital  signal 
processor  chip  to  perform  real-time  coding.  Adaptive  predictive 
coders  [111,  sub-band  coders  with  adaptive  bit  allocation  [12],  and 
multi-pulse  linear  predictive  coders  [131  are  a  few  examples  of  coders 
capable  of  providing  high  quality  speech  at  16  kbits/sec.  These 
coders  have  been  implemented  on  a  single  DSP  chip  and  subjective 
tests  bused  on  these  implementations  provide  a  mean  opinion  score 
(MOS)  of  3.9  for  the  multi-pulse  coder,  and  3.8  for  the  adaptive  bit 
allocation  sub-bund  coder,  both  operating  at  16  kbiu/sec.  For 
comparison,  standard  mu-law  PCM  (56  kbps)  and  ADPCM 
(32  kbps)  coders  have  MOS  of  4.5  and  4.0,  respectively.  Another 
hybrid  coder  that  combines  the  adaptive  predictive  and  multi-pulse 
coders  has  produced  speech  at  16  kbits/sec  with  quality  exceeding 
that  of  the  ADPCM  coder  at  32  kbiu/sec  [14].  A  multi-pulse  coder 
is  capable  of  providing  high  quality  speech  at  rata  even  lower  than 
16  kbits/sec,  and  details  of  this  technique  are  discussed  next. 

m  Speech  Synthesis  Model*  for  Low  Bit  Rate  Coding 

A  proper  speech  synthais  model,  capable  of  reproducing  many 
different  voices  and  requiring  a  small  amount  of  control  information, 
is  ewwrisl  for  achieving  high  voice  quality  at  low  bit  rata.  A 
synthesis  model  that  has  been  popular  over  many  years  is  the 
traditional  vocoder  model  where  the  synthetic  speech  is  generated  by 
exciting  a  linear  filter  with  pitch  pulses  or  white  noise.  The 
limitations  of  this  simple  vocoder  model  are  now  well  known.  The 
multi-pulse  LPC  model  [15]  seeks  to  overcome  such  limitations  by 
replacing  the  traditional  pitch  pulse  and  white  noise  excitation  with  a 
sequence  of  pulsa  whose  amplitudes  and  locations  are  chosen  to 
minimize  the  perceptual  difference  between  original  and  synthetic 
speech  signals.  Figure  2  illustrates  both  the  traditional  vocoder  and 
the  multi-pulse  excitation  models.  The  multi-pulse  model  has  enough 
flexibility  to  reproduce  a  wide  variety  of  speech  waveforms,  including 
voiced  and  unvoiced  speech.  The  model  is  reasonably  efficient  in  that 
only  a  few  pulses  (typically  8  to  16  every  10  msec)  are  needed  in  the 
multi-pulse  excitation  to  produce  high  quality  synthetic  speech. 
Further  reduction  in  the  number  of  pulses,  in  particular  for  high- 
pitebed  voices,  can  be  achieved  by  incorporating  a  linear  filter  with  a 
pitch  loop  in  the  synthesizer  [13]. 


VOCODER  MODEL 


Fig.  1  Speech  quality  versus  bit  rate  for  different  types  of  coders. 

As  shown  in  Fig.  1,  two  traditional  coder  technologia  are 
waveform  coders  and  vocoders.  Waveform  coders  aim  at  reproducing 
the  speech  waveform  as  faithfully  as  possible.  They  provide  high 
quality  speech  above  16  kbits/sec  but  their  performance  usually  falls 
off  rapidly  at  much  lower  bit  rata.  Vocoders  use  a  model  of  human 
speech  production  to  obtain  a  more  efficient  representation  of  the 
speech  signal,  and  thus  are  able  to  bring  the  bit  rate  down  to  much 
lower  values  —  even  as  low  is  400  bits/ sec  —  but,  with  present 
understanding,  the  speech  quality  is  significantly  impaired.  Our 
ability  to  provide  high  quality  speech  below  16  kbits/sec  is  limited  at 
present,  but  the  next  generation  of  coden,  taking  advanuge  of  new 
capabilitia  offered  by  VLSI  technology  is  well  as  new  undenUnding 
in  speech  coding,  promise  to  fill  this  gap  in  performance. 

Speech  coding  methods  have  been  sUndardized  both  at  64  and 
32  kbiu/sec  and  coden  at  these  rata  are  being  used  in  the  public 
switched  telephone  networks.  There  are  no  published  civil  standards 
yet  for  lower  bit  rata,  although  there  exisu  a  miliury  sUndard  for  a 
2.4  kbiu/sec  vocoder. 


EUIT8TIBI 

I  t  t  -LI 


mcn-mcT 

MOREL 


STMTMETIC 

JrtECH 


MULTI-PULSE  MODEL 


roe*l-T»*CT 

NOOEl 


Fig.  2  The  traditional  vocoder  and  multi-pulse  models  for  speech 
synthais. 
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Recently,  another  model,  baaed  on  stochastic  excitation  (15),  haa 
shown  great  promise  for  producing  high  quality  speech  at  low  bit 
rates.  In  this  model,  the  excitation  is  selected  from  a  codebook  of 
random  white  Gaussian  sequences  using  a  fidelity  criterion  that 
minimizes  the  perceptual  difference  between  the  original  and  synthetic 
speech  signals.  The  different  synthesis  models  are  illustrated  in 
Fig.  3.  Both  multi-pulse  and  stochastic  models  use  identical  linear 
filters  to  introduce  correlations  at  short  and  long  delays  in  the  output 
speech  signal,  but  they  differ  mainly  in  the  manner  in  which  the 
excitation  to  the  linear  filter  is  specified. 

SPEECH  SYNTHESIS  MONELS 
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Fig.  3  Different  speech  synthesis  models. 


IL3  Exdtatios  Models  foe  Low  BH  Rats  Voice 

The  principle  for  determining  the  excitation  in  multi-pulse  coden 
is  illustrated  in  Fig.  4.  The  synthetic  speech  signal  at  the  output  of 
the  synthesis  filter  is  compared  with  the  original  speech  signal  and  the 
error  signal  is  further  processed  to  produce  a  measure  of  perceptual 
error.  This  processing  includes  linear  filtering  of  the  objective  error  to 
attenuate  those  frequencies  where  the  error  is  perceptually  less 
important  and  amplify  those  frequencies  where  the  error  is 
perceptually  more  important.  The  excitation  is  chosen  to  minimize 
the  perceptual  error. 


MULTI-PULSE  EXCITATION  ANALYSIS  PHOCEOUHE 
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Fig.  4  Block  diagram  illustrating  the  procedure  for  determining 
the  optimum  excitation  in  multi-pulse  and  stochastic  coder*. 


The  locations  and  amplitudes  of  the  pulses  in  the  multi-pulse 
excitation  are  obtained  sequentially  —  one  pulse  at  a  time.  After  the 
first  pulse  ha*  been  determined,  a  new  error  is  computed  by 
subtracting  out  the  contribution  of  the  first  pulse  to  the  error  and  the 
location  of  the  next  pulse  is  determined  by  finding  the  minimum  of 
the  new  error.  The  process  of  locating  new  pulses  is  continued  until 
the  error  is  reduced  to  acceptable  values  or  the  number  of  pulses 
reaches  the  maximum  value  that  can  be  encoded  at  the  specified  bit 
rate.  The  speech  quality  and  tbe  bit  rate  for  the  multi-pulse 
excitation  are  determined  by  tbe  number  of  pulses;  4  to  8  pulses  in  a 
3  msec  frame  are  sufficient  for  producing  high  quality  speech. 

tt.4  Hlgb-QwHty  Speech  Below  8  khita/sec 

Recent  speech  coding  work  using  stochastically-excited  linear 
predictive  coding  hat  ibown  great  promise  for  producing  high  quality 
speech  below  8  kbits/ tec  and  poatibly  at  low  at  4.8  kbits/sec  [16]. 
Such  low  rates  are  attractive  for  transmitting  digital  speech  over 
narrow  band  radio  channels  and  for  providing  end-to-end  digital 
speech  communication  over  ordinary  dial-up  public  telephone  lines. 
The  excitation  In  stochastic  coders  Is  determined  by  an  exhaustive 
search  from  s  oodebook  of  white  Gaussian  sequences  to  minimize  the 
perceptual  distortion  in  the  synthetic  speech.  The  search  procedure 
for  stochastic  excitation  is  illustrated  in  Fig.  5.  These  coders  are 
extremely  complex  and  require  more  than  50  million  multiply/add 
operations  per  second.  The  rapid  progress  in  custom  VLSI  circuits 
will  enable  us  to  handle  this  complexity  in  the  next  few  years.  The 
architecture  of  the  stochastic  coder  is  well  suited  for  VLSI 
implementation  since  the  search  procedure  carries  out  a  large  number 
of  simple  identical  operations,  namely,  the  computation  of  error  for 
each  member  of  the  codebook. 

STOCHASTIC  CODING  OF  EXCITATION 
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Fig.  5  Search  procedure  for  determining  the  best  stochastic  code. 


EL  SPEECH  SYNTHESIS 

A  principal  objective  in  speech  synthesis  is  to  produce  natural 
quality  synthetic  speech  from  unrestricted  text  input  The  goal  is  to 
provide  great  versatility  for  having  a  machine  speak  information  to  a 
human  user,  in  as  natural  a  manner  u  possible.  Useful  applications 
of  speech  synt>  ais  include  announcement  machines  (e.g.  weather, 
time),  computer  answer  back  (voice  messages,  prompts),  information 
retrieval  from  databases  (stock  price  quotations,  bank  balances), 
reading  lids  for  tbe  blind,  and  speaking  aids  for  the  vocally 
handicapped. 

There  are  at  least  three  major  factors  influencing  tbe  performance 
of  speech  synthesizers.  Tbe  first  factor  is  tbe  quality  (or  naturalness) 
of  the  synthesis.  It  is  often  possible  to  trade  between  quality  and 
message  flexibility.  For  example,  simple  announcement  machines 
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often  use  tb«  beet  speech  codint  methods  to  five  high  quality  speech, 
because  the  message*  to  be  spokea  an  fixed  ia  context  and  limited  in 
number.  However,  text-to-speech  systems  aim  for  great  flexibility  and 
mill  versatility,  and,  ia  this  case,  the  speech  signal  must  be 
synthesized  from  fundamental  units. 

A  second  factor  is  the  size  of  the  vocabulary.  If  a  relatively  small 
vocabulary  is  required,  (e.g.  100-500  words),  it  is  possible  to  custom- 
adjust  the  synthesis  for  improved  naturalness.  However,  for 
vocabularies  of  more  than  1000  words,  customized  tuning  is 
inappropriate. 

A  third  factor  affecting  speech  synthesis  is  the  cost  (or, 
complexity)  of  the  system.  The  cost  includes  hardware  required  for 
storage  of  words,  phrases,  dictionaries  and  production  rules,  u  well  sa 
hardware  required  for  speech  signal  feneration  (e.g.  coder, 
synthesizer).  The  oost  of  synthesis  systems  has  fallen  rapidly  with 
advanoee  in  VLSI,  so  this  factor  is  becoming  Ion  of  an  issue. 

HL1  Speech  Synthaata  from  Stored  Ceded  Speech 

The  easiest  method  of  providing  veioe  output  for  machines  ia  to 
create  speech  meuages  by  concatenation  of  prerecorded  and  digitally 
stored  words,  phrases,  and  sentences  spoken  by  a  human.  There  are 
several  trade-offs  to  be  considered  here.  Using  words  as  the  basic 
synthesis  unit  seem  to  be  a  proper  choice,  in  many  cases,  because  it 
allows  one  to  create  a  Urge  number  of  utterances  from  a  relatively 
small  number  of  words.  However,  the  process  of  joining  words  and 
creating  sentences  with  the  correct  prosody  is  much  more  difficult. 
This  problem  can  be  avoided  by  recording  sentences,  but  the  number 
at  sentences  to  be  stored  increases  exponentially  with  the  number  of 
words  in  a  sentence.  In  order  to  reduce  the  storage  requirement,  a 
variety  of  speech  coding  methods  can  be  used.  Simple  speech  coding 
procedures  can  produce  high  quality  speech  at  32  kbits/sec.  Speech 
coding  techniques,  such  as  multi-pulse  LPC,  can  bring  the  data  rate 
down  to  10  kbits/ tec.  At  this  bit  rate,  approximately  100  sec  of 
speech  data  can  be  stored  on  a  single  1  megabit  ROM  chip.  MPLPC 
is  capable  of  producing  high  quality  speech  using  a  simple  speech 
synthesizer,  most  of  the  complexity  of  multi-pulse  LPC  is  in  the 
speech  analysis  part  that  has  to  be  done  only  once  and  does  not  need 
real-time  operation.  Data  rates  as  low  ss  1000  bita/aec  can  be 
realized  using  LPC  vocoding  techniques  but  the  speech  quality  ia 
much  lower  (the  speech  is  intelligible  but  lacks  naturalness)  at  these 
low  data  rates. 

The  flexibility  of  stored -speech  synthesis  systems  can  be  further 
enhanced  by  allowing  control  of  prosody  (pitch  and  duration 
adjustments)  during  the  synthesis  process.  The  MPLPC  technique  is 
particularly  suitable  for  providing  the  desired  control  of  pitch  and 
duration.  With  the  decreasing  coat  of  digital  storage,  stored -speech 
synthesis  techniques  could  provide  low  coat  voice  output  for  many 
applications. 

Figure  6  shows  a  block  diagram  of  a  general  coocatenative  type  of 
synthesis  system.  The  storage  consists  of  a  fixed  set  of  words,  phrases, 
and  sentences  which  have  been  encoded.  An  input  message,  which  is 
a  sequence  of  words,  phrases,  and  sentences,  is  converted  to  the 
appropriate  sequence  of  units  which  are  retrieved  and  concatenated 
(usually  with  some  type  of  smoothing  st  the  junctions  between  units). 
The  concatenated  units  are  seat  to  a  decoder  (synthesizer)  and  to  a 
digital-lo-analog  converter  for  tranamiaaior.  and/or  playback.  The 
coocatenative  type  of  synthesis  is  used  primarily  in  announcement 
machines,  and  for  applications  such  as  automatic  intercept  of 
incorrectly  dialed  telephone  numbers,  where  only  a  small  vocabulary 

Is  required,  and  a  limited  set  of  output  sentences  is  needed  [  17]. 

ffl.2  Text-te-Speech  Synthesis 

Stored -speech  systems  are  not  flexible  enough  to  convert 
unrestricted  printed  text-to-speech  —  the  objective  of  text-to-speech 
systems.  Applications  include  accessing  and  speaking  electronic  mail, 
rending  machines  for  the  blind,  and  automated  directory  ass  is  lance 
systems  that  speak  subscriber  names  and  telephone  numbers.  A  text- 


fig.  6  Block  diagram  of  a  concatenation  type  of  speech 
synthesizer. 

to-speech  system  must  be  able  to  accept  the  incoming  text  —  which 
often  includes  abbreviation!,  Roman  numerals,  dates,  times,  formulas, 
and  wide  variety  of  punctuation  marks  —  and  convert  it  into  a 
-linkable  form.  The  text  is  translated  into  a  phonetic  transcription, 
using  a  Urge  pronouncing  dictionary  supplemented  by  appropriate 
letter-to-sound  rules.  A  stored  library  of  about  2000  LPC  (or 
formant)  coded  speech  segments  spans  the  range  of  speech  sounds  of 
a  given  lingua ge  and  provides  the  means  for  converting  the  phonetic 
elements  to  spoken  form.  Thus  the  system  is  able  to  synthesize 
virtually  unrestricted  sequences  of  phonemes.  Speech  waveforms  are 
finally  generated  from  acoustic  parameters  using  LPC  or  formant 
synthesis.  The  resulting  speech  is  intelligible  and  acceptable  for  a 
variety  of  applications. 

A  block  diagram  of  a  text-to-speech  synthesizer  CITS)  is  shown  in 
Ftg.  7.  In  its  most  general  form,  the  input  to  the  system  is  a  message 
in  the  form  of  unrestricted  ASCII  text,  and  the  output  of  the  system 
is  the  continuously  spoken  message.  The  system  has  three  major 
modules;  letter-to-sound  conversion,  sound-to- parameter  assembly,  and 
synthesis  from  a  parametric  description  of  the  text.  The  letter-to- 
sound  conversion  can  utilize  either  a  set  of  programmed  pronouncing 
rule*  or  a  stored  pronouncing  dictionary  (which  provides  the  phonetic 
spelling  of  every  word  in  the  text  message)  or  a  mixture  of  these  two 
technique*.  Even  with  dictionaries  of  several  hundred  thousand 
words,  there  will  be  cases  where  the  words  of  the  ASCII  text  are  not 
always  found  (e.g.  proper  names,  cities,  specialized  terminology),  and 
for  such  cases  programmed  pronunciation  rules  are  mandatory.  In 
addition  to  deriving  the  phonetic  symbols  that  correspond  to  the  text 
of  the  input  message,  the  first  module  must  also  provide  prosody 
markers  (pitch,  duration,  intensity)  for  the  message  to  be  spoken. 
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Fig.  7  Block  diagram  of  a  text-to-speech  synthesizer  based  on 
synthesis  from  sub-word  units. 

The  second  stage  of  the  TTS  system  performs  the  conversion  from 
phonetic  symbols  to  continuous  synthesis  parameters,  based  on  the  set 
of  sub-word  units  used  to  represent  the  speech.  Thus  if  dyad  units  are 
used,  s  conversion  from  phonetic  symbols  to  dyads  is  required, 
followed  by  retrieval  and  smoothing  of  the  synthesis  parameters 
corresponding  to  the  dyads  in  the  message.  Continuous  contours  for 
pitch  and  timing  are  also  computed  in  this  stage.  New  work  in 
automatic  parsing  and  syntax  analysis  is  providing  improved 
capabilities  in  computing  speech  prosody.  The  final  stage  is  the 
synthesis  of  speech  from  the  parametric  representation  of  the  sub¬ 
word  units.  Typically  an  LPC.  MPLPC,  or  formant  synthesizer  is 
used  (18,19). 
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TTS  synthesizers  can  be  used  for  database  acceu,  such  u  itock 
price  quotations  and  bank  balance  checking,  for  access  to  voluminous 
amounts  of  text  material  over  telephone  lines  (e.g.  medical  or  legal 
encyclopedias),  and  as  reading  aids  for  the  visually  handicapped. 
Current  TTS  synthesizers  produce  speech  which  approaches  the  word 
intelligibility  of  natural  speech,  but  the  quality  is  typically  synthetic 
sounding.  They  perform  with  large  vocabularies  and  great  flexibility 
and  at  relatively  low  cost.  The  challenge  in  speech  synthesis  over  the 
next  several  years  is  to  improve  voice  quality  and  increase  flexibility 
by  providing  a  range  of  voice  styles  (male,  female,  child),  voice 
characteristics  (Southern  drawl,  New  England  accent,  etc),  and 
different  languages.  In  this  manner  TTS  systems  can  be  tailored  both 
for  the  application,  and  far  the  intended  set  of  users. 

Rapidly  advancing  VLSI  technology  will  have  a  large  impact  on 
future  speech  synthesis  technology.  Present  computer  models  of 
speech  synthesis  are  simple  in  comparison  to  human  speech 
generation,  and  it  is  not  yet  practical  to  implement  more  sophisticated 
synthesis  models.  But  future  advances  in  fundamental  understanding 
of  speech  production  and  language,  and  of  syntactic  and  semantic 
analysis,  will  contribute  significantly  to  improved  text-lo-speech 
synthesis. 

IV.  SPEECH  RECOGNITION 

Figure  8  shows  a  block  diagram  of  the  traditional,  pattern 
recognition  based,  speech  recognition  model.  The  input  speech  signal 
cun  be  anything  from  a  single  word  (or  s  sequence  of  isolated  words), 
to  a  sentence  of  continuous  speech.  The  first  processing  block  it 
feature  measurement  in  which  the  speech  signal  is  spectrally  analyzed, 
periodically  in  time,  to  give  a  series  of  spectral  feature  vectors 
characteristic  of  the  behavior  of  the  speech  signal.  For  the  moat  part 
we  have  used  linear  predictive  coding  (LPC)  as  the  spectral 
representation,  but  other  spectral  analyses  like  filter  bank  analysis  are 
equally  suitable  [20].  The  time  sequence  of  spoctral  features  is  called 
a  test  pattern. 


Fig.  8  Block  diagram  of  a  speech  recognizer  incorporating  syntax 
and  semantic*. 

The  second  step  in  the  processing  it  a  pattern  similarity 
measurement  in  which  the  running  set  of  spectral  vectors  (the  test 
pattern)  is  compared  to  a  set  of  stored  reference  patterns,  and  for 
each  such  comparison  a  distance  or  similarity  score  results.  For  the 
moat  part  we  hare  used  single  words  as  the  stored  reference  patterns, 
but  even  in  cases  in  which  the  basic  recognition  units  are  smaller 
words  (e.g.,  syllables,  demisyllablet,  dyads,  phonemes,  etc.),  a  lexicon 
can  be  used  to  build  word  reference  patterns  and  so  we  are 
equivalently  using  words  as  the  recognition  unit  The  pattern 
similarity  measurement  typically  involves  time  registration  of  the 
stored  reference  pattern  (which  consists  of  a  aeries  of  feature  vectors) 
with  the  running  speech  (which  Is  also  a  series  of  feature  vectors). 


The  technique  of  dynamic  time  warping  (DTW)  is  generally  used  to 
provide  the  optimal  alignment  between  references  and  test  (speech) 
patterns  [21], 

The  basic  procedures  of  time  alignment  are  illustrated  in  Fig.  9 
which  shows  representative  contours  of  a  test  and  reference  pattern 
(the  lengths  of  both  patterns  have  been  made  equal  here;  in  general 
they  art  different  and  this  difference  must  be  accounted  for).  It  can 
be  seen  that  distinctive  events  in  the  two  patterns  (i.e.,  peaks  in  the 
contours)  do  not  occur  at  the  same  time  instants.  Thus  the  purpose  of 
the  DTW  alignment  procedure  is  to  derive  an  optimal  time  alignment 
between  test  and  reference  patterns  by  locally  shrinking  or  expanding 
the  time  axis  of  one  of  the  patterns  to  optimally  match  the  other 
pattern.  An  efficient  mathematical  procedure  exists  for  obtaining  an 
optimal  alignment  curve  based  on  dynamic  programming  techniques 
[21].  The  alignment  curve  for  the  examples  of  Fig.  9  is  shown  at  the 
upper  right  of  this  figure.  The  similarity  (or  equivalently  distance) 
between  a  reference  and  test  pattern  is  defined  as  the  normalized  sum 
of  the  spectral  similarities  (distances),  along  the  discrete  set  of  points 
in  the  optimal  time  alignment  path,  between  reference  and  test 
patterns. 


Fig.  9  Illustration  of  time  alignment  between  a  test  and  reference 
pattern. 

The  third  step  in  the  processing  of  Fig.  8  is  syntax  control  which 
uses  task  syntax  to  determine  the  proper  sequencing  of  stored 
reference  patterns  (words)  for  the  task  at  hand.  The  syntax  control 
could,  in  theory,  also  exercise  control  over  the  feature  measurement 
algorithm,  thereby  changing  the  type  (and/or  form)  of  analysis 
depending  on  the  sound  to  be  reoognized.  Sucb  sophisticated  control 
has  not  been  used  in  current  speech  recognizers. 

The  last  step  in  processing  of  Fig.  8  is  a  semantic  processor  which 
chooses,  as  the  recognized  speech,  the  sentence  (or  word)  which  has 
both  the  smallest  distance  (or  highest  similarity)  to  the  input  speech, 
and  which  is  semantically  meaningful  (given  that  it  has  already  been 
checked  for  syntax). 

IV.l  HMMMoMf 

An  alternative  to  using  templates  to  characterize  words  (or  sub¬ 
word  units)  is  to  build  probabilistic  models  which  describe, 
statistically,  the  time-varying  spectral  characteristics  of  the  word. 
One  very  popular  form  of  such  probabilistic  models  is  the  bidden 
Markov  model  (HMM)  [7],  an  example  of  which  is  shown  in  Fig.  10. 
This  model  has  N  states  (5  in  the  example  shown)  and  each  state 
physically  corresponds  (in  some  vague  sense)  to  a  set  of  temporal 
events  in  the  speech  sound.  The  overall  HMM  ij  characterized  by  a 
state  transition  matrix.  A,  (which  describe*  how  new  state*  may  be 
reached  from  old  states)  and  by  a  statistical  characterization  of  the 
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HIDDEN  MARKOV  MODEL  (LEFT -TO -RIGHT) 


A  •  [oij]  *  PROS  (STATE  j  |  STATE  I ) 

B  *  [bj  (x)]  ■  PROB  (ANALYSIS  VECTOR  x  |  STATE  j) 

Fig.  10  A  hidden  Markov  model  (HMM)  suitable  for  representing 
a  single  word. 

acoustic  vectors,  B,  (the  analysis  feature  vectors,  x)  within  the  state. 

The  changes,  to  the  recognition  structure  of  Fig.  8,  required 
because  of  using  HMM’s  rather  than  templates  are  minimal.  The 
store  of  template  reference  patterns  is  replaced  by  a  store  of  reference 
models,  and  the  pattern  similarity  algorithm  uses  statistical  scoring 
instead  of  distances  and  uses  a  somewhat  different  alignment 
procedure  to  line  up  states  of  the  reference  model  to  frames  of  the  test 
pattern. 

IVJ  Performance  Re— Its  -  Isolated  Words 

For  isolated  word  recognition,  the  classic  technique  has  been  to 
build  templates  or  statistical  models  based  on  natural  spoken 
occurrences  of  the  word.  In  the  simplest  case  a  word  reference 
pattern  is  created  directly  from  one  or  more  spoken  occurrences  of  the 
word  by  a  given  talker.  In  a  more  sophisticated  application,  a  set  of 
multiple  occurrences  of  the  word  is  clustered  to  give  one  (or  more) 
word  reference  patterns.  The  patterns  may  be  talker  specific  (the  so- 
called  speaker  dependent  (SD)  recognizers),  or  speaker  independent 
(SI),  depending  on  the  way  they  are  derived.  The  vocabulary  sizes, 
for  which  isolated  word  systems  have  been  tested,  range  from  a  few 
words  (e-g.  10  digits),  up  to  over  1000  words,  (e-g.,  1109  words  of 
Basic  English).  Table  1  summarizes  the  current  performance  for  a 
range  of  vocabularies,  for  both  SD  and  SI  cases.  It  can  readily  be 
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99.2% 
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SI 
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54  Computer  Terms 
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96.5% 
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88% 
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91% 
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SD 

79.2% 

TaMo  1.  Performance  of  Isolated  Word  Recogrizers  as  a  Fassctioo  of 
Vocahedary  Size 


seen  that  the  complexity  of  the  words  in  the  vocabulary  (i.e.,  how 
similar  are  the  nearest  sounding  word  pairs)  is  more  important  than 
mere  vocabulary  size. 

rvj  Co— acted  Word  Recogaltioa 

A  somewhat  more  complicated  task  in  speech  recognition  is  that  of 
recognizing  speech  which  is  nominally  spoken  as  a  connected  word 
string,  e.g.,  digit  strings  for  dialing  telephone  numbers,  letter  strings 
for  spelling  names,  etc.  The  manner  in  which  such  strings  are 
recognized,  baaed  on  the  statistical  pattern  recognition  approach,  is 
illustrated  in  Fig.  11.  We  assume  each  word  in  the  vocabulary  is 
represented  by  one  or  more  reference  patterns  (i.e.,  templates  or 
statistical  models)  and  that  the  unknown  spoken  word  string  can  be 
recognized  by  finding  the  beat  concatenation  of  reference  patterns 
which  matches  the  teat  pattern.  There  are  several  problems  associated 
with  trying  to  find  the  optimal  sequence  of  reference  patterns  to 
match  the  unknown  teat  pattern,  including: 

1.  The  number  of  words  in  the  test  pattern  is  generally  unknown. 

2.  The  locations,  in  time,  of  the  boundaries  between  words  is 
unknown:  in  fact  there  really  are  no  well  defined  boundaries  in 
many  cases  since  the  end  of  one  word  often  merges  smoothly 
with  the  beginning  of  the  following  word. 

3.  Matches  between  reference  and  test  patterns  are  generally  poor 
at  the  beginnings  and  ends  of  reference  patterns  because  of  the 
high  degree  of  variability. 

4.  Combinations  of  matching  strings  exhaustively  (i.e.,  by  trying 
all  combinations,  of  all  lengths,  of  all  reference  patterns)  is 
prohibitively  expensive. 


CONNECTED  WORD  RECOGNITOR  FROM  WORD  TEMPLATES 


Fig.  11  Illustration  of  the  problems  associated  with  recognizing  a 
connected  word  string  from  single  word  reference  patterns. 

Fortunately,  several  algorithms  have  been  devised  which  optimally 
solve  the  matching  problem  without  an  exponential  growth  in 
computation  as  the  vocabulary  or  size  of  the  string  grows  (22-251. 
One  such  algorithm  is  the  level  building  (LB)  procedure,  which  allows 
the  recognition  processing  to  proceed  in  a  series  of  levels  (words)  to 
determine  the  best  connected  word  string  match  for  every  permissible 
string  length.  Thus  solutions  to  problems  1  and  4,  above,  have  been 
found.  No  perfect  solution  to  problems  2  and  3  is  known.  However, 
a  reasonable  approach,  and  one  which  has  worked  quite  well  to  date, 
is  to  extract  word  reference  patterns,  from  tokens  obtained  from 
connected  word  strings.  Thus  the  reference  patterns  for  digits,  for 
example,  are  obtained  from  analysis  of  s  training  set  of  connected 
digit,  strings;  hence  each  reference  pattern  has  information  about  the 
spectral  dynamics  of  digits  in  strings,  rather  than  in  isolation. 
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IV.4  Psrfor—uca  Remits  —  Co— acted  Words 

Table  2  summarizes  current  performance  of  connected  word 
recognizers  based  on  the  LB  algorithm.  For  a  digits  vocabulary,  in  a 
speaker  trained  mode,  string  accuracies  greater  than  98%  hare  been 
obtained  for  unknown  length  strings.  In  a  speaker  independent  mode, 
the  best  string  accuracy  has  been  only  about  90%.  Results  are  also 
given  in  Table  2  for  connected  letter  recognition  of  spelled  names 
from  a  17,000  name  directory,  and  for  an  airlines  reservation  and 
information  task  based  on  a  vocabulary  of  127  words. 
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UL  is  for  Unknown  Length  Strings 

KL  is  for  Known  Length  Strings 

IV  JS  Camtimmom  Speech  Recognition 

Based  on  experience  with  more  limited  speech  recognition  tasks, 
work  l  '4  begun  oo  building  a  large  vocabulary  (1000-20,000  word), 
natural  syntax  G.e.,  approaching  that  of  spoken  English)  con  tin  sous 
speech  recognition  system.  A  block  diagram  of  the  proposed  system 
architecture  is  given  in  Fig.  12.  This  similarity  of  Fig.  12  to  Fig.  8 
should  be  obvious  to  the  reader.  The  major  complications  in  building 
such  a  recognizer  are  the  following: 

1.  Words  cannot  be  the  basic  unit  for  recognition;  instead  sub-word 
units  must  be  used.  Possible  sub-word  units  include  syllables, 
demi-syllablet,  diphones,  dyads,  phonemes,  etc. 

2.  A  lexicon  must  be  used  which  describes  bow  words  are  made  up 
from  the  sub-word  units.  The  lexicon  can  be  an  explicit 
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Fig.  12  "Model  for  large  vocabulary  speech  recognition  based  on 
sub- word  speech  units. 


representation  (e.g.,  a  dictionary  of  pronunciations  from  sub¬ 
word  units),  or  it  can  be  probabilistic  in  nature. 

3.  A  language  model  is  used  to  describe  the  constraints  among 
words  in  the  language.  The  language  model  could  be  a  formal 
grammar,  a  statistical  model,  or  even  an  explicit  state  diagram 
of  task  syntax  as  used  in  Fig.  8. 

Each  of  the  complications  listed  above  is  formidable  and  leads  to  a 
wide  range  of  choices  of  how  to  handle  the  problem.  Taken  together 
they  give  an  idea  as  to  why  continuous  speech  recognition  is,  and  will 
remain,  an  unsolved  problem  for  a  long  time. 

V.  SPEAKER  RECOGNITION 

The  speaker  recognition  problem  is  really  a  pair  of  problems  — 
namely  speaker  identification,  in  which  a  talker  is  identified  as  one  of 
a  given  set  of  talker*,  and  speaker  verification,  in  which  the  talker 
give*  both  a  claimed  identity,  and  a  transaction  request,  and  the 
system  decides  whether  to  accept  or  reject  the  identity  claim.  It 
should  be  clear  that  speaker  identification  is  a  much  harder  problem 
than  speaker  verification,  since,  as  the  number  of  speakers  increase 
without  bound,  the  probability  of  error  goes  to  1  in  identifying  a 
talker,  whereas  the  probability  of  error  remains  fairly  constant  for 
speaker  verification. 

Figure  13  shows  a  block  diagram  of  a  speaker  recognition  system. 
The  input  speech,  which  can  be  either  a  sentence,  or  a  sequence  of 
words  (e.g.,  digits),  is  first  spectrally  analyzed,  and  then  the  resulting 
spectral  pattern  is  compared  to  stored  reference  patterns,  using  DTW 
methods.  For  speaker  identification,  the  pattern  similarity  processing 
must  be  performed  for  each  assumed  talker  G.e.,  for  the  entire  set  of 
talkers),  and  the  decision  box  chooses  the  identified  talker  as  the  one 
with  the  highest  similarity  to  the  input  speech.  For  speaker 
verification,  the  pattern  similarity  processing  is  only  performed  for  the 
claimed  identity.  i.e,  only  a  single  distance  score  results.  Based  on 
the  transaction  requested  and  the  similarity  score  of  the  DTW 
processing,  the  decision  box  decide*  whether  to  accept  or  reject  the 
claim  d  identity.  Thus,  for  a  banking  transaction,  a  lower  degree  of 
similarity  would  be  required  to  deposit  money  into  tn  account,  than  to 
withdraw  money  from  the  account. 

CUMMI 


Fig.  13  Block  diagram  of  a  speaker  verification  system. 

It  can  be  seen,  by  comparing  Figs.  8  and  13,  that  the  processing 
for  speech  and  speaker  recognition  is  quite  similar.  Thu*  as 
fundamental  improvements  are  made  in  any  of  the  basic  procedure* 
(feature  measurement,  pattern  similarity,  etc.),  the  performance  of 
both  types  of  systems  improves. 

Key  fseton  affecting  tne  performance  of  speaker  verification 
systems  are  the  type  of  input  string  which  is  used,  the  features  used  to 
characterize  the  voice  pattern,  and  the  type  of  transmission  system 
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over  which  the  verification  lystem  is  used.  Best  performance  is 
achieved  when  sentence  Ion*  utterances  are  used  in  a  relatively  noise- 
free  speaking  environment.  Conversely,  poorer  performance  is 
achieved  for  short,  unconstrained,  spoken  utterances,  in  a  noisy 
environment.  Table  3  summarizes  current  performance  of  several 
types  of  speaker  verification  systems  [26-271.  Current  research  in 
speaker  recognition  aims  to  improve  performance  by  adapting  the 
talker  patterns,  over  time,  to  track  changes  in  voice  patterns. 
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VL  CONCLUDING  COMMENT 

This  overview  of  digital  speech  processing  has  aimed  to  highlight 
recent  advances,  current  areas  of  research,  and  key  issues  for  which 
new  fundamental  understanding  of  speech  is  needed.  Future  progress 
in  speech  processing  trill  surely  be  linked  closely  with  advances  in 
computation,  microelectronics  and  algorithm  design. 
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ABSTRACT 


Algorithms  for  speech  recognition  can  be  charact^i^o  Lioau.y  as  pattern  recognition  approaches  and  acoustic 
phonetic  approaches.  To  date,  the  greatest  degree  of  success  in  speech  recognition  has  been  obtained  using 
pattern  recognition  paradigms.  Thus,  in  this  paper,  we  will  be  concerned  primarily  with  showing  how  pattern 
recognition  techniques  have  been  applied  to  the  problems  of  isolated  word  (or  discrete  utterance)  recognition, 
connected  word  recognition,  and  continuous  speech  recognition.  We  will  show  that  our  understanding  (and 
consequently  the  resulting  recognizer  performance)  is  best  for  the  simplest  recognition  tasks  and  is  considerably 
less  well  developed  for  large  scale  recognition  systems. 

I.  Introduction 

The  ultimate  goal  of  most  research  is  speech  recognition  is  to  develop  a  machine  that  had  the  ability  to 
understand  fluent,  conversational  speech,  with  unrestricted  vocabulary,  from  essentially  any  talker.  Although  the 
promise  of  such  a  capable  machine  is  as  yet  unfulfilled,  the  field  of  automatic  speech  recognition  has  made 
significant  advances  in  the  past  decade  [1-3].  This  is  due,  in  part,  to  the  great  advances  made  in  VLSI 
technology,  which  have  greatly  lowered  the  cost  and  increased  the  capability  of  individual  devices  (e.g. 
processors,  memory),  and  in  part  due  to  the  theoretical  advances  in  our  understanding  of  how  to  apply  powerful 
mathematical  modelling  techniques  to  the  problems  of  speech  recognition. 

When  setting  out  to  define  the  problems  associated  with  implementing  a  speech  recognition  system,  one  finds 
that  there  are  a  number  of  general  issues  that  must  be  resolved  before  designing  and  building  the  system.  One 
such  issue  is  the  size  and  complexity  of  the  user  vocabulary.  Although  useful  recognition  systems  have  been  built 
with  as  few  as  two  words  (yes,  no),  there  are  at  least  four  distinct  ranges  of  vocabulary  size  of  interest.  Very 
small  vocabularies  (on  the  order  of  10  words)  are  most  useful  for  control  tasks  -  e.g.  all  digit  dialing  of  telephone 
numbers,  repertory  name  dialing,  access  control  etc.  Generally  the  vocabulary  words  are  chosen  to  be  highly 
distinctive  words  (i.e.  of  low  complexity)  to  minimize  potential  confusions.  The  next  range  of  vocabulary  size  is 
moderate  vocabulary  systems  having  on  the  order  of  100  words.  Typical  applications  include  spoken  computer 
languages,  voice  editors,  information  retrieval  from  databases,  controlled  access  via  spelling  etc.  For  such 
applications,  the  vocabulary  is  generally  fairly  complex  (i.e.  not  all  pairs  of  words  are  highly  distinctive),  but 
word  confusions  are  often  resolved  by  the  syntax  of  the  specific  task  to  which  the  recognizer  is  applied.  The  third 
vocabulary  range  of  interest  is  the  large  vocabulary  system  with  vocabulary  sizes  on  the  order  of  1000  words. 
Vocabulary  sizes  this  large  are  big  enough  to  specify  fairly  comfortable  subsets  of  English  and  hence  are  used  for 
conversational  types  of  applications  -  e.g.  the  IBM  laser  patent  text,  basic  English,  etc.  [4,5],  Such  vocabularies 
are  inherently  very  complex  and  rely  heavily  on  task  syntax  to  resolve  recognition  ambiguities  between  similar 
sounding  words.  Finally  the  last  range  of  vocabulary  size  is  the  very  large  vocabulary  system  with  10,000  words 
or  more.  Such  large  vocabulary  sizes  are  required  for  office  dictation/word  processing  and  language  translation 
applications. 

Although  vocabulary  size  and  complexity  is  of  paramount  importance  in  specifying  a  speech  recognition 
system,  several  other  issues  can  also  greatly  affect  the  performance  of  a  speech  recognizer.  The  system  designer 
must  decide  if  the  system  is  to  be  speaker  trained,  or  speaker  independent;  the  format  for  talking  must  be  specified 
(e.g.  isolated  inputs,  connected  inputs,  continuous  discourse);  the  amount  and  type  of  syntactic  and  semantic 
information  must  be  specified;  the  speaking  environment  and  transmission  conditions  must  be  considered;  etc. 
The  above  set  of  issues,  by  no  means  exhaustive,  gives  some  idea  as  to  how  complicated  it  can  be  to  talk  about 
speech  recognition  by  machine. 

There  are  two  general  approaches  to  speech  recognition  by  machine,  the  statistical  pattern  recognition 
approach,  and  the  acoustic-phonetic  approach.  The  statistical  pattern  recognition  approach  is  based  on  the 
philosophy  that  if  the  system  has  “seen  the  pattern,  or  something  close  enough  to  it,  before,  it  can  recognize  it.” 
Thus,  a  fundamental  clement  of  the  statistical  pattern  recognition  approach  is  pattern  training.  The  units  being 
trained,  be  they  phrases,  words,  or  sub-word  units,  are  essentially  irrelevant,  so  long  as  a  good  training  set  is 
available,  and  a  good  pattern  recognition  model  is  applied.  On  the  other  hand,  the  acoustic-phonetic  approach  to 
speech  recognition  has  the  philosophy  that  speech  sounds  have  certain  invariant  (acoustic)  properties,  and  that  if 
one  could  only  discover  these  invariant  properties,  continuous  speech  could  be  decoded  in  a  sequential  manner 
(perhaps  with  delays  of  several  sounds).  Thus,  the  basic  techniques  of  the  acoustic-phonetic  approach  to  speech 
recognition  arc  feature  analysis  (i.e.  measurement  of  the  invariants  of  sounds),  segmentation  of  the  feature 
contours  into  consistent  groups  of  features,  and  labelling  of  the  segmented  features  so  as  to  detect  words, 
sentences,  etc. 

To  date,  the  greatest  success  in  speech  recognition  have  been  achieved  using  the  pattern  recognition  approach. 
Hence,  for  the  remainder  of  this  paper,  wc  will  restrict  our  attention  to  trying  to  explain  how  the  model  works, 
and  how  it  has  been  applied  to  the  problems  of  isolated  word,  connected  word,  and  continuous  speech  recognition. 


II.  The  Statistical  Pattern  Recognition  Model 

Figure  1  shows  a  block  diagram  of  the  pattern  recognition  model  used  for  speech  recognition.  The  input 
speech  signal,  s(n),  is  analyzed  (based  on  some  parametric  model)  to  give  the  test  pattern,  T,  and  then  compared 
to  a  prestored  set  of  reference  patterns,  {RJ.  livSV  (corresponding  to  the  V  labelled  patterns  m  the  system) 
using  a  pattern  classifier  (i.e.  a  similarity  procedure).  The  pattern  similarity  scores  are  then  sent  to  a  decision 
algorithm  which,  based  upon  the  syntax  and/or  semantics  of  the  task,  chooses  the  best  transcription  of  the  input 
speech. 
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Figure  1.  Pattern  Recognition  Model  for  Speech  Recognition. 

There  are  two  types  of  reference  patterns  which  can  be  used  with  the  model  of  Fig.  1.  The  first  type,  called 
nonparametric  reference  patterns,  are  patterns  created  from  one  or  more  real  world  tokens  of  the  actual  pattern. 
The  second  type,  called  statistical  reference  models,  are  created  as  a  statistical  characterization  (via  a  fixed  type  of 
model)  of  the  behavior  of  a  collection  of  real  world  tokens.  Ordinary  template  approaches  [6],  are  examples  of 
the  first  type  of  reference  patterns;  hidden  Markov  models  [7,8]  are  examples  of  the  second  type  of  reference 
patterns. 

The  model  of  Fig.  1  has  been  used  (either  explicitly  or  implicitly)  for  almost  all  commercial  and  industrial 
speech  recognition  systems  for  the  following  reasons: 

1.  it  is  invariant  to  different  speech  vocabularies,  users,  feature  sets,  pattern  similarity  algorithms,  and 
decision  rules 

2.  it  is  easy  to  implement  in  either  software  or  hardware 

3.  it  works  well  in  practice. 

For  all  of  these  reasons  we  will  concentrate  on  this  model  throughout  this  paper.  In  the  remainder  of  this  paper 
we  will  discuss  the  elements  of  the  pattern  recognition  model  and  show  how  it  has  been  used  for  isolated  word, 
connected  word,  and  for  continuous  speech  recognition.  Because  of  the  tutorial  nature  of  this  paper  we  will 
minimize  the  use  of  mathematics  in  describing  the  various  aspects  of  the  signal  processing.  The  interested  reader 
is  referred  to  the  appropriate  references  (c.g.  6-14], 

1 1.1  Parametric  Representation 

Parametric  representation  (or  feature  measurement,  as  it  is  often  called)  is  basically  a  data  reduction  technique 
whereby  a  large  number  of  data  points  (in  this  case  samples  of  the  speech  waveform  recorded  at  an  appropriate 
sampling  rate)  are  transformed  into  a  smaller  set  of  features  which  are  equivalent  in  the  sense  that  they  faithfully 
describe  the  salient  properties  of  the  acoustic  waveform.  For  speech  signals,  data  reduction  rates  from  10  to  100 
are  generally  practical. 

For  representing  speech  signals,  a  number  of  different  feature  sets  have  been  proposed  ranging  from  simple 
sets,  such  as  energy  and  zero  crossing  rates  (usually  in  selected  frequency  bands),  to  complex,  complete 
representations,  such  as  the  short-time  spectrum  or  a  linear  predictive  coding  (LPC)  model.  For  recognition 
systems,  the  motivation  for  choosing  one  feature  set  over  another  is  often  complex  and  highly  dependent  on 
constraints  imposed  on  the  system  (e.g.  cost,  speed,  response  time,  computational  complexity  etc).  Of  course  the 
ultimate  criterion  is  overall  system  performance  (i.e.  accuracy  with  which  the  recognition  task  is  performed). 
However,  this  criterion  is  also  a  complicated  function  of  all  system  variables. 

The  two  most  popular  parametric  representations  for  speech  recognition  are  the  short-time  spectrum  analysis 
(or  bank  of  filters)  model,  and  the  LPC  model.  The  bank  of  filters  model  is  illustrated  in  Figure  2.  The  speech 
signal  is  passed  through  a  bank  of  Q  bandpass  filters  covering  Ihe  speech  hand  from  100  Hz  to  some  upper  cutoff 
frequency  (typically  between  3000  and  8000  Hz).  The  number  of  bandpass  filters  used  varies  from  as  few  as  5  to 
as  many  as  32.  The  filters  may  or  may  not  overlap  in  frequency.  Typical  filter  spacings  are  linear  until  about 
1000  Hz  and  logarithmic  beyond  1000  Hz  (9|. 
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Figure  2.  Bank  of  Filters  Analysis  Model. 

The  output  of  each  bandpass  filter  is  generally  passed  through  a  nonlinearity  (e.g.  a  square  law  detector  or  a 
full  wave  rectifier)  and  lowpass  filtered  (using  a  20-30  Hr  width  filter)  to  give  a  signal  which  is  proportional  to 
the  energy  of  the  speech  signal  in  the  band.  A  logarithmic  compressor  is  generally  used  to  reduce  the  dynamic 
range  of  the  intensity  signal,  and  the  compressed  output  is  resampled  (decimated)  at  a  low  rate  (generally  twice 
the  lowpass  filter  cutoff)  for  efficiency  of  storage. 

The  LPC  feature  model  for  recognition  is  shown  in  Figure  3.  Unlike  the  bank  of  filters  model,  this  system  is 
a  block  processing  model  in  which  a  frame  of  N  samples  of  speech  is  processed,  and  a  vector  of  features  Is 
computed.  The  steps  involved  in  obtaining  the  vector  of  LPC  coefficients,  for  a  given  frame  of  N  speech  samples, 
are  as  follows; 

1.  preemphasis  by  a  first  order  digital  network  In  order  to  spectrally  flatten  the  speech  signal 

2.  frame  windowing,  i.e.  multiplying  the  N  speech  samples  within  the  frame  by  an  <V-polnt  Hamming 
window,  so  as  to  minimize  the  endpoint  effects  of  chopping  an  /V-sample  section  out  of  the  speech  signal. 

3.  autocorrelation  analysis  in  which  the  windowed  set  of  speech  samples  is  autocorrelated  to  give  a  set  of 
(p  + 1 )  coefficients,  where  p  it  the  order  of  the  desired  LPC  analysis  (typically  8  to  12). 

4.  LPC  analysis  in  which  the  vector  of  LPC  coefficients  is  computed  from  the  autocorrelation  vector  using  a 
Levinson  or  a  Durbin  recursive  method  [10], 

New  speech  frames  are  created  by  shifting  the  analysis  window  by  M  samples  (typically  M  <  N)  and  the  above 
steps  are  repeated  on  the  new  frame  until  the  entire  speech  signal  has  been  analyzed. 

The  LPC  feature  model  has  been  a  popular  speech  representation  because  of  iti  ease  of  Implementation,  and 
because  the  technique  provides  a  robust,  reliable,  and  accurate  method  for  characterizing  the  spectral  properties  of 
the  speech  signal. 

As  seen  from  the  above  discussion,  the  output  of  the  feature  measurement  procedure'  is  basically  a  time- 
frequency  pattern  -  i.e.  a  vector  of  spectral  features  is  obtained  periodically  in  time  throughout  the  speech. 
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Figure  3.  LPC  Analysis  Model. 

11.2  Psttern  Training 

Pattern  training  is  the  method  by  which  representative  test  patterns  are  converted  into  refererce  patterns  for 
use  by  the  pattern  similarity  algorithm.  There  are  several  ways  in  which  pattern  training  can  be  performed, 
including: 

1.  casual  training  In  which  each  individual  training  pattern  Is  used  directly  to  create  either  a  non-parametric 
reference  pattern  or  a  statistical  model.  Casual  training  is  the  simplest,  most  direct  method  of  creating 
refenence  patterns. 

2.  robust  training  in  which  several  (i.e.  two  or  more)  versions  of  each  vocabulary  entry  are  used  to  create  a 
single  reference  pattern  or  statistical  model.  Robust  training  gives  statistical  confidence  to  the  reference 
patterns  since  multiple  patterns  are  used  in  the  training. 

3.  clustering  training  in  which  a  large  number  of  versions  of  each  vocabulary  entry  are  used  to  cieate  one  or 
more  reference  patterns  or  statistical  models.  A  statistical  clustering  analysis  is  used  to  determine  which 
members  of  the  multiple  training  patterns  are  similar,  and  hence  are  used  to  create  a  single  reference 
pattern.  Clustering  training  is  generally  used  for  creating  speaker  independent  reference  patterns,  in  which 
case  the  multiple  training  patterns  of  each  vocabulary  entry  are  derived  from  a  large  number  of  different 
talkers. 

The  final  rt  suit  of  the  pattern  training  algorithm  is  the  set  of  reference  patterns  used  in  the  recognition  phase  of 
the  model  of  Fig.  I . 

II J  Pattern  Similarity  Algorithm 

A  key  step  in  the  recognition  algorithm  of  Fig.  1  is  the  determination  of  similarity  between  the  measured 
(unknown)  test  pattern,  and  each  of  the  stored  reference  patterns.  Because  speaking  rates  vary  greatly  front 
repetition  to  repetition,  pattern  similarity  determination  involvea  both  time  alignment  (registration)  of  patterns,  and 
once  properly  aligned,  distance  computation  along  the  alignment  path. 

Figure  4  illustrates  the  problem  involved  in  time  aligning  a  test  pattern,  T(n),  I  i  n  S  NT  (where  each  7(n) 
is  a  vector),  and  a  reference  pattern  R(m),  I  5  m  5  NR.  Our  goal  is  to  find  an  alignment  function,  m  ■  w(n), 
which  maps  R  onto  the  corresponding  parts  of  T.  The  criterion  for  correspondence  is  that  some  measure  of 
distance  between  the  patterns  be  minimized  by  the  mapping  tv.  Defining  a  local  distance  meusure,  <((«,  »i),  as  the 
spectral  distance  between  vectors  T(n)  and  R(m),  then  the  task  of  the  pattern  similarity  algorithm  is  to  determine 
the  optimum  mapping,  tv,  to  minimize  the  tout  distance 

AT 

D*  *  min  >  d(l,  tv(/))  (I) 

*'(*!)  f  m  | 

The  solution  to  Eq.  (1)  can  be  obtained  in  an  efficient  manner  using  the  techniques  of  dynamic  programming,  In 
particular  a  class  of  procedures  called  dynamic  time  warping  (DTW)  techniques,  has  evolved  fur  solving  Bq.  ( I ) 
efficiently  (6), 

The  above  discussion  has  shown  how  to  time  align  a  pair  of  templates.  In  the  cate  of  aligning  statistical  models, 
an  analogous  procedure,  based  on  the  Viterbi  algorithm,  can  be  used  (7,f)|, 

11.4  Decision  Algorithm 

The  last  step  in  the  statistical  pattern  recognition  model  of  Fig,  i  is  the  decision  algorithm  which  utilizes  both 
the  set  of  pattern  similarity  scores  (distances)  and  the  system  knowledge,  In  terms  of  syntax  and/or  semantics,  to 

decode  the  speech  into  the  best  possible  transcription.  The  decision  algorithm  can  (and  generally  does) 
incorporate  some  form  of  nearest  neighbor  rule  to  process  the  distance  scores  to  increase  confidence  in  the  results 
provided  by  the  pattern  similarity  procedure.  The  system  syntax  helps  to  choose  among  the  candidates  with  the 
lowest  distance  score  by  eliminating  candidates  which  don't  satisfy  the  syntactic  constraints  of  the  task,  or  by 
deWeighting  extremely  unlikely  candidates.  The  decision  algorithm  can  also  have  the  capability  of  providing 
multiple  decodings  of  the  spoken  string.  This  feature  ia  especially  useful  in  cases  in  which  multiple  candidates 
have  indistlnguiahably  different  distance  scores. 
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Figure  4.  Example  of  Time  Registration  of  a  Test  and  Reference  Pattern. 


II. 5  Summary 

We  have  now  outlined  the  basic  signal  processing  steps  in  the  pattern  recognition  approach  to  speech 
recognition.  In  the  next  sections  we  illustrate  how  this  model  has  been  applied  to  problems  in  isolated  word, 
connected  word,  and  continuous  speech  recognition. 

III.  Results  on  Isolated  Word  Recognition 

Using  the  pattern  recognition  model  of  Fig.  1,  with  an  8'*  order  LPC  parametric  representation,  and  using  the 
non-parametric  template  approach  for  reference  patterns,  a  wide  variety  of  tests  of  the  recognizer  have  been 
performed  with  isolated  word  inputs  in  both  speaker  dependent  (SD)  and  speaker  independent  (SI)  modes. 
Vocabulary  sizes  have  ranged  from  as  small  as  10  words  (i.e.  the  digits  zero-nine)  to  as  many  as  1109  words. 
Table  I  gives  a  summary  of  recognizer  performance  under  the  conditions  discussed  above.  It  can  be  seen  that  the 
resulting  error  rates  are  not  strictly  a  function  of  vocabulary  size,  but  also  are  dependent  on  vocabulary 
complexity.  Thus  a  simple  vocabulary  of  200  polysyllabic  Japanese  city  names  had  a  2.7%  error  rate  (in  an  SD 
mode),  whereas  a  complex  vocabulary  of  39  alphadigit  terms  (in  both  SD  and  SI  modes)  had  error  rates  of  about 
5-8%. 

Table  I  also  shows  that  in  cases  where  the  same  vocabulary  was  used  in  both  SD  and  SI  modes  (e.g.  the 
alphadigits  and  the  airline  words),  the  recognizer  gave  comparable  performances.  This  result  indicates  that  the  SI 
mode  clustering  analysis,  which  yielded  the  set  of  SI  templates  or  models,  was  capable  of  providing  the  same 
degree  of  representation  of  each  vocabulary  word  as  either  casual  or  robust  training  for  the  SD  mode.  Of  course 
the  computation  of  the  SI  mode  recognizer  was  comparably  higher  than  that  required  for  the  SD  mode  since  a 
larger  number  of  templates  or  models  were  used  in  the  pattern  similarity  comparison. 


Vocabulary 

Mode 

Error  Rate  (%) 

10  Digits 

SI 

0.8 

37  Dialer  Words 

SD 

0.0 

39  Alphadigits 

SD 

4.5 

SI 

7.7 

54  Computer  Terms 

SI 

3.5 

129  Airline  Words 

SD 

1.0 

SI 

2.9 

200  Japanese  Cities 

SD 

2.7 

1 109  Basic  English 

SD 

4.3 

Table  I 

Performance  of  Template-Based 
Isolated  Word  Systems 


The  multi  in  Table  I  an  baaed  on  wing  either  word  template*  or  atatiitical  model*  created  from  itolated  word 
training  token*,  Studies  have  shown  that  when  adequate  training  data  is  available,  the  performance  of  isolated 
word  recognitor*  baaed  on  statistical  models  Is  comparable  to  or  better  than  that  of  recognizer*  based  on 
templates,  The  main  lisue  here  is  the  amount  of  training  data  available  relative  to  the  number  of  parameters  to  be 
eitimited  In  the  itatiitical  model,  Tor  small  amount*  of  training  data,  very  unreliable  parametric  estimates  result, 
and  the  template  approach  is  generally  superior  to  the  statistical  model  approach.  For  moderate  amounts  of 
training  data,  the  performance  of  both  type*  of  models  Is  comparable.  However,  for  large  amounts  of  training 
data,  the  performance  of  statistical  models  Is  generally  superior  to  that  of  template  approaches  because  of  their 

ability  to  accurately  characterise  the  tall*  of  the  distribution  (i.e,  the  outliers  in  terms  of  the  templates). 

I  ▼  t  vlMRVvWV  TTPl  KVWgNVMM  nOOW 

The  basic  approach  to  connected  word  recognition  from  discrete  reference  patterns  is  shown  in  Fig.  S. 
Assume  we  are  givan  a  test  pattern  T,  which  represents  an  unknown  spoken  word  string,  and  we  are  given  a  set  of 
V  reference  patterns,  f/f| ,  ffj.  Ry}  each  representing  some  word  of  the  vocabulary.  The  connected  word 
recognition  problem  consists  of  finding  the  "super"  reference  pattern,  R',  of  the  form 

RJ  ■  /t«i)  ® 

which  Is  the  concatenation  of  L  reference  patterns,,  J-(  1 1 . which  best  matches  the  test  string,  T, 

in  the  sense  that  the  overall  distance  between  T  end  R'  is  minimum  over  all  possible  choices  of  L, 
y(l),  vU) . v(t),  when  the  distance  is  on  appropriately  chosen  distance  measure. 

CONNECTED  WORD  RECOONmON  FROM  WORD  TEMPLATES 


Figure  5.  Illustration  of  Connected  Word  Recognition  from  Word  Templates. 

There  are  several  problem*  associsted  with  solving  the  above  connected  word  recognition  problem.  First  we 
don't  know  L,  the  number  of  words  In  the  string.  Hence  our  proposed  solution  must  provide  the  best  matches  for 
all  reasonable  values  of  L,  e.g.  L  «  I,  2,  ...,£*/**.  Second  we  don't  know  nor  can  be  reliably  find  word 
boundaries,  even  when  we  have  postulated  L,  the  number  of  words  In  the  string.  The  implication  is  that  the  word 
recognition  algorithm  must  work  without  direct  knowledge  of  word  boundaries;  in  fact  the  estimated  word 
boundariei  will  be  shown  to  be  a  byproduct  of  the  matching  procedure.  The  third  problem  with  a  template 
matching  procedure  is  that  the  word  matches  are  generally  much  poorer  at  the  boundaries  than  at  frames  within 
the  word,  In  general  this  is  i  weakness  of  word  matching  schemes  which  can  be  somewhat  alleviated  by  the 
matching  procedures  which  can  apply  lessor  weight  to  the  match  at  template  boundaries  than  at  frames  within  the 
word,  A  fourth  problem  Is  that  word  durations  in  the  string  are  often  grouty  different  (shorter)  than  the  durations 
of  the  corresponding  reference  patterns.  To  alleviate  this  problem  one  can  use  some  time  prenormalization 
procedure  to  warp  the  word  durations  accordingly,  or  rely  on  reference  pattemi  extracted  from  embedded  word 
atrlngi.  Finally  the  last  problem  associated  with  matching  word  strings  is  that  the  combinatorics  of  matching 

airings  exhaustively  (I.e,  by  trying  all  combinations  of  reference  pattemi  in  a  sequential  manner)  is  prohibitive. 

A  number  of  different  ways  of  solving  the  connected  word  recognition  problem  have  been  proposed  which 
avoid  the  plague  of  combinatorics  mentioned  above.  Among  these  algorithms  are  the  2-level  DP  approach  of 
Sakoe  (11),  the  level  building  approach  of  Myere  and  Rabiner  [12],  the  parallel  single  stage  approach  of  Bride!  el 
al,  1 1 3],  and  the  nonuniform  sampling  approach  of  Oauvain  and  Marianl  [14].  Although  each  of  these  approaches 
differs  greatly  In  implementation,  all  of  them  are  similar  in  that  the  basic  procedure  for  finding  R'  is  to  solve  a 
time-alignment  problem  between  T  and  R'  using  dynamic  time  warping  (DTYW)  methods. 


REFERENCE  FRAME 
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The  level  building  DTW  based  approach  to  connected  word  recognition  is  illustrated  in  Fig.  6.  Shown  in  this 
figure  are  the  warping  paths  for  all  possible  length  matches  to  the  test  pattern,  along  with  the  implicit  word 
boundary  markers  (if  |,  *2»  —  •  <t-i  .  *t)  f°r  the  dynamic  path  of  the  L- word  match.  The  level  building 
algorithm  has  the  property  that  it  builds  up  all  possible  £-word  matches  one  level  (word  in  the  string)  at  a  time. 
For  each  string  match  found,  a  segmentation  of  the  test  string  into  appropriate  matching  regions  for  each  reference 
word  in  R'  is  obtained.  In  addition  for  every  string  length  L,  the  best  Q  matches  (i.e.  the  Q  lowest  distance  L- 
word  strings)  can  be  found.  The  details  of  the  level  building  algorithm  are  available  elsewhere  [12],  and  will  not 
be  discussed  here. 


Figure  6.  Sequence  of  DTW  Warps  to  Provide  Best  Word  Sequences  of  Several  Different  Lengths. 

Typical  performance  results  for  connected  word  recognizers,  based  on  a  level  building  implementation,  are 
shown  in  Table  11.  For  a  digits  vocabulary,  string  accuracies  of  98-99%  have  been  obtained.  For  name  retrieval, 
by  spelling,  from  a  17.0C0  name  directory,  string  accuracies  of  from  90%  to  96%  have  been  obtained.  Finally, 
using  a  moderate  size  vocabulary  of  127  words,  the  accuracy  of  sentences  for  obtaining  information  about  airlines 
schedules  is  between  95%  and  99%.  Here  the  average  sentence  length  was  close  to  10  words.  Many  of  the  errors 
occurred  in  sentences  with  long  strings  of  digits. 

V.  Continuous,  Large  Vocabulary,  Speech  Recognition 

The  area  of  continuous,  large  vocabulary,  speech  recognition  refers  to  systems  with  at  least  1000  words  in  the 
vocabulary,  a  syntax  approaching  that  of  natural  English  (i.e.  an  average  branching  factor  on  the  order  of  100), 
and  possibly  a  semantic  model  based  on  a  given,  well  defined,  task.  Figure  7  shows  a  block  diagram  of  a 
continuous  speech,  large  vocabulary  recognition  system.  For  this  system,  there  are  three  distinct  issues  that  must 
be  resolved,  namely  choice  of  a  basic  recognition  unit  (and  a  modelling  technique  to  go  with  it),  a  method  of 
mapping  recognized  units  into  words  (or,  more  precisely,  a  method  of  creating  word  models  from  individual  sub¬ 
word  units),  and  a  way  of  representing  the  formal  syntax  of  the  recognition  task  (or,  more  precisely,  a  way  of 
integrating  the  syntax  directly  into  the  recognition  algorithm). 


5-8 


VOCABULARY 

MODE 

WORD 

ACCURACY 

TASK 

STRINO  (TA.'K) 
ACCURACY 

Speaker  Dependent 

>  999b  SI 

1-7  Digit  Strings 

98.3% 

SI* 

Digits  (1 1  Words) 

or 

Speaker  Independent 

>  99%  SD 

1-7  Digit  Strings 

99% 

SD* 

Letters  of  the 

Speaker  Dependent 

-90%  SD 

Directory  Listing 

96% 

SD 

Alphabet  (26  words) 

or 

or  SI 

Retrieval 

90% 

SI 

Speaker  Independent 

(17,000  Name  Directory) 

Airline  Terms 

Speaker  Dependent 

>99%  SD 

Airline 

99% 

SD 

(129  words) 

or 

99%  SI 

Information 

93% 

SI 

Speaker  Independent 

and 

Reservations 

*  Known  itrlng  length. 


Table  II 

Performance  of  Connected  Word  Recognizers  on 
Specific  Recognition  Taaki 


Kigurs  7.  Blot  i  Diagram  of  System  for  Large  Vocabulary  Recognition. 

For  oath  of  the  three  pen*  of  the  continuous  speech  recognition  problem,  there  are  several  alternative 
*|i|muu:lM*.  For  the  basic  recognition  unit,  one  c mid  consider  whole  words,  half  syllables  such  as  dyads, 

detnlsyllshles,  or  dlphones,  or  sound  units  as  small  as  phonemes  or  phones.  Whole  word  units,  which  are 

allracilve  heuause  of  oor  knowledge  of  how  to  handle  them  In  connected  environments,  are  totally  impractical  to 

train  since  each  wool  could  appear  In  a  broad  variety  of  contexts.  Therefore  the  amount  of  training  required  to 

uapluie  all  ilia  types  of  word  environments  is  unrealistic.  For  the  sub-word  units,  (he  required  training  is 

ssitutslve,  hut  can  he  carriad  out  using  a  variety  of  well  known,  c*  .ting  training  procedures,  A  full  system 
typically  requires  between  1000  and  2000  half  syllable  speech  units.  For  the  phoneme-llke  units,  only  about 
10- 100  units  need  lo  be  trained, 

'Dte  pmbltnt  of  repreieniing  vocabulary  words,  in  terms  of  the  chosen  speech  unit,  has  several  possible 
solutions,  One  ctutld  crests  a  network  of  linked  word  unit  models  for  each  vocabulary  word.  The  network  could 
ha  elthar  a  deterministic  (fixed)  or  a  stochastic  structure,  An  alternative  Is  to  do  lexical  access  from  a  dictionary 
In  which  all  word  pronunciation  variants  (and  posalbly  part  of  speech  Information)  are  stored,  along  with  a 
mapping  hum  }itonunciailon  units  to  speech  representation  units, 

Finally  the  problem  of  representing  the  task  syntax,  and  integrating  it  into  the  recognizer,  has  several 
solutions,  'the  task  syntax,  or  grammar,  van  ha  rapreaented  as  a  deterministic  state  diagram,  as  a  stochastic  model 
(e,g  a  model  uf  word  trl-gram  statistics),  or  a*  a  formal  grammar.  There  are  advantages  and  disadvantages  to 
each  of  Ihtaa  approaches 

We  lllusireie  the  state-of-ihe  an  In  large  vocebulery  speech  recognition  with  two  examples,  one  based  on 
phoneme- like  suit- wool  units  with  a  alnglt  entry  per  word  dictionary  and  a  task  grammar  with  an  average 
perplexity  (word  branching  factor)  of  00,  the  other  based  on  a  statistically  defined  sub-word  units,  a  statistical 
word  modal  and  a  statistical  language  model  with  an  averaga  perplexity  or  about  100.  The  former  aystem  has 
been  applied  In  the  teak  of  ahlp  managament  |l3,lft|i  the  latter  system  has  been  applied  to  the  task  of  automatic 
liansulptlim  nt  nffltia  dictation  1 1 7), 


V.l  CoeU— cue  gpsech,  Speaker  la  dependent,  Largs  Vecabelaey  RreagaUlw 

For  thU  system  the  basic  recognition  unit  li  a  set  of  47  context  independent  phone- Ilka  unli*  (PLU'a)  whore 
eoch  FLU  i*  btMd  on  the  traditional  phonetic  definition  of  a  phoneme.  Each  sub-word  unit  U  opre rented  by  a 
left-to- right  3-Mate  hidden  Maikov  modal.  Word*  an  npmenred  aa  anquanoe*  of  the  beak  PLU'e  ai  ditanninod 
by  a  etandaid  phonetic  pronunciation  dictionary;  only  a  tingle  pronunciation  ia  ured  for  each  word.  Training  of 
the  unite  ia  accomplished  via  standard  connected  word  training  algorithms,  baaed  on  the  ret  of  tub-word  unite. 
The  recognition  task,  which  was  the  DARPA  Resource  Management  task,  it  represented  aa  a  Unite  Mate  grummet 
with  991  word  arcs,  4  silence  arcs,  and  16  null  arcs  (whan  no  output  symbol  la  amlttad),  ai  shown  in  Mg,  I,  Tha 
grammar  also  has  a  word-pair  list  which  specifics  which  set  of  word  can  follow  aach  of  tha  991  words  in  the 
grammar,  The  average  word  branching  factor  it  60, 


Figure  S.  Finite  State  Network  Repreeenlation  of  the  DARPA  Task  Syntax. 

Both  context  independent  (Cl)  end  context  dependent  (CD)  unita  were  usad,  with  appropriate  modifications  tu 
the  word  dictionary  for  the  CD  units.  Recognition  performtnee,  In  terms  of  word  accuracy  on  2  Independent  teat 
sets  and  on  a  subset  of  the  training  set,  is  shown  in  Table  III  for  3  sets  of  unita. 


Unit  Set 

ISO  Sentence 
Test  Set 

300  Sentence 
Test  Set 

160  Sentence 
Training  Set 

47  Cl  PLU 

89,9 

86.0 

93.3 

638  CD  PLU 

93.3 

91.9 

98.7 

1076  CD  PLU 

(CMU) 

93.7 

93.9 

Tabls  III 

Performance  of  Largs  Vocabulary,  Speaker  Independent, 

Continuous  Speech  Recognition  System  on  991  Word  DARPA  Task 
(Results  are  word  accuracies  In  %,  i.s,  %  correct  -  %  Insertions) 

It  can  be  seen  from  Table  111  that  word  recognition  accuracies  on  the  order  of  93-94%  can  be  achieved  on  this 
teak. 

V4  Isolated  Speech,  Speaker  Trained,  Very  Large  Vocabulary  Recognition 

This  system  uses  phoneme-like  units  in  i  statistical  model  to  represent  words,  where  each  phoneme-llke  unit  Is 
a  statistical  model  based  on  vector-quantized  spectral  outputs  of  s  speech  spectrum  analysis.  A  third  statistical 
model  is  used  to  represent  syntax;  thus  the  recognition  task  is  essentially  a  Bayesian  optimization  over  a  triply 
embedded  sequence  of  itatiatlcil  models.  The  computational  requirements  are  very  large,  but  a  system  has  been 
implemented  using  isolated  word  inputs  for  the  task  of  automatic  transcription  of  office  dictation.  For  a 
vocabulary  of  3000  words,  in  a  speaker  trained  mode,  with  20  minutes  of  training  for  each  talker,  the  average 
word  error  rates  for  3  talkers  are  2%  for  prerecorded  speech,  3.1%  for  reed  speech,  and  3.7%  for  spontaneously 


spoken  speech  [17]. 

VI.  Summary 

In  this  paper  we  have  reviewed  and  discussed  the  general  pattern  recognition  framework  for  machine 
recognition  of  speech.  We  have  discussed  some  of  the  signal  processing  and  statistical  pattern  recognition  aspects 
of  the  model  and  shown  how  they  contribute  to  the  recognition. 

The  challenges  in  speech  recognition  are  many.  As  illustrated  above,  the  performance  of  current  systems  is 
barely  acceptable  for  large  vocabulary  systems,  even  with  isolated  word  inputs,  speaker  training,  and  favorable 
talking  environment.  Almost  every  aspect  of  continuous  speech  recognition,  from  training  to  systems 
implementation,  represents  a  challenge  in  performance,  reliability,  and  robustness. 
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SUMMARY 

This  lecture  is  intended  to  give  an  overview  of  assessment  methods  for  speech 
communication  systems,  speech  synthesis  systems  and  speech  recognition  systems.  The 
first  two  systems  require  an  evaluation  in  terms  of  intelligibility  measures.  Several 
subjective  and  objective  measures  will  be  discussed. 

Evaluation  of  speech  recognizers  requires  a  different  approach  as  the  recognition 
rate  normally  depends  on  recognizer-specific  parameters  and  external  factors.  Some 
results  of  the  assessment  methods  for  recognition  systems  will  be  discussed. 

Case  studies  are  given  for  each  group  of  systems. 


1  INTRODUCTION 

Assessment  methods  for  speech  processing  systems  can  be  divided  into  three  groups: 

-  subjective  and  objective  intelligibility  measures  for  speech  transmission  and  coding 
systems  (human-to-human) 

-  subjective  and  objective  quality  measures  for  speech  output  systems  (machine-to- 
human) 

(predictive)  assessment  methods  for  automatic  speech  recognition  systems  (human-to- 
machine) . 

Several  methods  are  used  for  the  subjective  evaluation  of  speech  transmission 
systems.  The  difference  between  the  methods  concerns  mainly  the  type  of  speech  material 
used  for  the  test  and  the  response  method.  Frequently  used  methods  are  based  upon 
segmental  evaluation,  suprasegmental  evaluation  or  overall  quality  measures.  This  is 
covered  by  phonemes,  words  and  sentences  respectively. 

Also  objective  methods  in  which  the  transmission  quality  is  quantified  by  physical 
parameters  are  used.  The  relation  between  these  methods  and  their  specific  aspects  will 
be  discussed. 

For  speech  output  systems  some  additional  aspects  may  be  involved  such  as 
intonation.  The  speech  signal  can  be  composed  of  individual  speech  tokens  like  phonemes, 
diphones,  or  larger  portions  which  may  result  in  distortions  not  usual  for  transmission 
channels.  Some  tests,  like  quality  ratings,  will  be  discussed. 

Speech  input  systems  are  normally  evaluated  in  relation  to  a  certain  application. 
This  is  done  with  a  custom-tailed  speech  data-base  or  in  a  field  experiment.  However, 
more  general  applicable  evaluation  methods  such  as  predictive  methods  are  also  becoming 
available. 

For  all  three  groups  the  military  application  requires  a  careful  evaluation  of  the 
environmental  conditions  such  as  high  noise  levels,  g,  stress,  mask  microphones  etc.  The 
general  approach  of  including  these  conditions  into  the  test  method  will  be  discussed. 

In  some  countries  such  as  the  UK,  France  and  The  Netherlands,  joint  national 
research  programs  are  started  to  coordinate  research  efforts.  A  European  research 
project  (sponsored  by  ESPRIT)  was  started  in  1988.  with  this  ESPRIT  SAM  project 
(multilingual  speech  input/output  assessment,  methodology  and  standardization)  seven 
countries  work  together  on  the  development  and  evaluation  of  speech  input/output 
assessment  methods.  In  the  USA  an  advanced  research  program  is  proceeding  on  the 
development  and  application  of  speech  input/output  systems  in  military  conditions.  In 
NATO  a  research  study  group  (AC/243 (panel  3)/RSG-10)  is  involved  with  the  application  of 
speech  input/output  systems  in  the  multilingual  military  environment. 


2.1  SUBJECTIVE  AND  OBJECTIVE  INTELLIGIBILITY  MEASURES  FOR  SPEECH  TRANSMISSION  AND 
CODING  SYSTEMS  (HUMAN-TO-HUMAN) 

A  number  of  subjective  tests  have  been  developed  during  the  forties,  and  are 
extensively  used  for  the  evaluation  of  speech  communication  channels.  There  are  also  two 
objective  test-methods  available.  These  tests  are  based  on  the  generation  and  analysis 
of  a  special  speechlike  testsignal. 

We  can  classify  the  intelligibility  tests  with  respect  to  their  use:  items  tested, 
diagnostic  information,  minimum  number  of  subjects  required  for  reliable  results, 
training  and  measuring  time.  Another  aspect  is  the  application:  are  we  comparing  and 
rank-ordering  systems,  are  we  evaluating  a  system  for  a  specific  application  or  are  we 
supporting  the  development  of  a  system? 

When  we  restrict  ourselves  to  the  subjective  tests,  a  general  qualification  can  be 
made  to  the  items  tested  and  the  manner  of  response.  The  lowest  level  (segmental 
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•valuation  i.«.  phonemes)  ia  oovarad  by  tha  rhyme  tests  and  tha  opan  raaponaa  word 

taata. 

A  rhyaa  taat  ia  a  aultipla  choioa  taut  whara  a  liatanar  haa  to  aalaot  tha  auditorily 
praaantad  word  fros  a  aaall  group  of  viaually  praaantad  poaaibla  Tcaponsaa.  in  ganaral 
only  tha  initial  consonants  of  tha  raaponaa  worda  ara  changed  auoh  as  Ban,  Dan,  Tam, 
Kan.  Pan.  Fraguantly  uaad  rhyaa  taata  ara  tha  Diagnostic  Rhyaa  Taat  (DRT)  and  tha 
Modified  Rhyaa  Taat  (NRT) . 

Tha  DRT  ia  baaad  on  two  foroad  choioa  alternatives  [29],  [16],  [21]  vhila  tha  MRT  ia 
baa ad  on  aix  alternatives  [6].  Aa  tha  raaponaa  aat  ia  limited,  a  liatanar* h  rerponss  nay 
not  aoinoida  with  what  ia  aotually  hoard  by  tha  liatanar.  Raoant  atudiaa  have  shown  that 
results  obtained  with  a  DRT  aay  ovar-astiaata  spaaoh  intelligibility  and  distort  tha 
perceptual  apace  and  tharafora  tha  diagnostic  value  of  tha  results  [3],  [24].  A  sore 
ganaral  approach  is  obtainsd  with  an  opan  rasponsa,  as  with  word  testa. 

(ford-tests  ara  baaad  on  short  nonsansa  or  meaningful  words  of  tha  CVC-typo 
(conaonant-vowel-consonant) .  Tha  test  words  ara  prasonted  in  isolation  or  in  a  carrier 
phrase.  Tha  listener  can  respond  with  any  CVC  combination  ha  has  hoard.  Hence  all 
confusions  batwaan  tha  phonemes  ara  possible.  Tha  test  results  inoluda  tha  phoneme 
score,  tha  word  score  and  tha  confusions  batwaan  tha  initial  oonaonanta,  vowals  and 
final  oonaonanta.  Tha  oonfuaion  matrices  present  useful  information  to  improve  tha 
performance  of  a  ay stem  [32]. 

Quality  rating  ia  a  more  ganaral  method  uasd  to  evaluate  tha  user's  acceptance  of  a 
transmission  channel  or  speech  output  system.  Tha  elaim  of  soma  invastigatora  [Goodman 
and  Nash,  4]  is  that  a  quality  rating  includes  tha  total  auditory  impression  of  spaaoh 
on  a  liatanar  and  can  ba  used  to  discriminate  batwaan  good  and  excellent  quality.  For 
quality  ratings  normal  tast  sentenced  or  a  free  conversation  is  usad  to  obtain  tha 
listener's  impression.  Tha  listener  is  aakad  to  rata  his  impress ion  on  a  subjective 
scale  like  tha  fiva-point  scale i  bad,  poor,  fair,  good  and  excellent.  Different  types  of 
scales  ara  usad  such  as:  intelligibility,  quality,  acceptability,  naturalness  ato. 

Tha  speech  reception  threshold  (SRT)  measures  tha  word  or  sentence  intelligibility 
against  a  level  of  naski  .g  noise.  Tha  listener  has  to  raoall  a  presented  sentence  which 
was  masked  by  noise.  After  a  correct  rasponsa  tha  noise  level  is  increased,  while  after 
a  falsa  rasponsa  the  noise  level  is  decreased.  This  procedure  leads  to  an  estimation  of 
tha  noise  level  whara  a  SOI  correct  recall  of  the  presented  sentences  is  obtained  [17]. 
The  quality  of  the  apeeoh  ia  related  to  the  amount  of  noise  whioh  is  necessary  for  thn 
masking.  The  procedure  has  the  advantage  that  it  can  be  performed  with  naive  listeners. 

A  raoant  nav  dsvslopasnt  is  the  use  of  anomalous  ssntancas.  These  syntactically 
corraot  but  semantically  anomalous  ssntancas  consist  of  approximately  seven  words.  The 
words  are  aommon  mono-syllabic  words  with  which  an  unlimited  number  of  sentences  aan  bs 
gsnersted  randomly.  These  sentences  are  constructed  according  to  some  predefined 
grammatical  structures.  This  test  will  bs  evaluated  by  the  ESPRIT  CAM  project. 


Fig.  1  Relation  between  signal. -to-nolsa  ratio  (BNR)  and  sens 
intelligibility  measures. 


6-3 


Fig.  1  gives  for  five  intelligibility  measures  the  aoora  aa  a  function  of  tha 
signal-to-nolae  ratio  of  apaach  combined  with  noiaa  [30].  This  gives  an  impression  of 
tha  affective  range  of  aaoh  taat.  Tha  given  relation  bafewaan  intelligibility  and  tha 
signal-to-nc iaa  ratio  ia  only  valid  for  noiaa  with  a  frequency  apaotrua  equal  to  tha 
long  tan  apaach  apaotrua.  Thia  la  for  inatanoa  tha  caaa  with  voioa  babbla.  k  signal-to- 
noiaa  ratio  of  0  dB  aaana  that  tha  apaach  and  tha  noiaa  hava  aqual  anargy. 

Aa  can  be  aaan  from  tha  figura  tha  nonaanaa  CVC-worda  diaorininata  ovar  a  wida  ranga 
whila  naaningful  taatworda  hava  a  allghtly  aaallar  ranga  [l].  Tha  digita  and  tha  apall 
alphabet  giva  a  saturation  at  a  SNR  of  -s  dB.  Thia  ia  dua  tot  (a)  tha  limit  ad  number  of 
taatworda  and  (b)  recognition  of  theaa  words  in  mainly  controlled  by  tha  vowala  rather 
than  tha  conaonanta.  Vowala  have  an  average  level  approximately  18  dB  above  the  average 
level  of  conaonanta,  and  are  therefore  more  resistant  against  noiaa.  on  the  other  hand 
non-linear  distortion  aa  clipping  will  have  a  greater  impaot  on  the  vowala  than  on  the 
consonants.  Therefore  tha  uaa  of  the  spell  alphabet,  where  the  recognition  ia  mainly 
based  on  vowels,  may  lead  to  misleading  results. 

A  wall  balanced  taat,  aa  has  bean  found  in  our  study,  ia  tha  CVC-word  teat  based  on 
nonsense  words  and  with  the  test-words  embedded  in  a  oarriar  phrase.  A  carrier  phrase 
(whiah  is  in  many  studies  neglected)  will  cause  echoes  and  reverberation  in  conditions 
with  a  distortion  in  tha  time  domain.  Also  AOC  settling  will  ba  established  by  the 
carrier  phrase,  and  pronouncing  the  extra  words  stabilises  tha  vooal  effort  of  the 
talker. 

Tha  uaa  of  nonaanaa  words  increases  tha  open  response  design  of  auoh  a  test  and 
extends  tha  ranga  of  the  teat  in  order  to  discriminate  at  higher  qualities,  ass  Fig.  l. 

Tha  reproduaibility  of  a  test  strongly  depends  on  tha  number  of  talkers  and 
listeners  uaad  for  tha  experiments.  In  general,  for  cvc- tests,  4-1  talkers  and  *«• 
listeners  era  used.  It  has  bean  found  that  the  amount  of  variation  among  individual 
results  la  equal  for  talkers  and  listeners,  ao  in  a  balanced  experiment  these  numbers 
ahould  ba  aqual. 

The  teat-rataat  reproducibility  can  ba  given  by  an  index  (cronbaeh  a)  this  la  shown 
in  rig.  2  where  the  a- index  is  given  as  a  function  of  tha  number  of  talker-lietener 
pairs  for  soma  of  tha  intelligibility  tests  as  discussed  above  [34], 


Fig.  3  Tast-retest  a- index  as  a  function  of  tha  number  of  talker- 
listener  pairs  tor  some  intelligibility  measures. 


Tha  effort  required  and  the  poor  information  on  the  type  of  degradation  of  the 
channel  by  the  subjective  methods  have  lad  to  the  development  of  objective  measuring 
techniques.  Frenoh  and  Steinberg  [2]  published  a  method  for  predicting  the  speech 
intelligibility  of  a  transmission  anannel  froa  its  physical  parameters.  By  using  this 
method  a  relevant  index  (Articulation  Index,  Al)  was  obtained.  The  method  was 

reconsidered  by  Xryter  [30]  who  greatly  increased  its  aocessibility  by  the  introduction 
of  a  calculation  sohema,  work  sheets,  and  tables.  The  Al  is  based  oni  (a)  the 

calculation  of  tha  affaotiva  eignal-to-noise  ratio  within  a  number  of  frequency  bands, 
(b)  the  contribution  of  masking,  (o)  a  linear  transformation  of  the  effective  aignal-to- 
noise  ratio  to  an  oetave-band-apcoifio  contribution  from  one  to  sero,  end  (d)  the 

calculation  of  a  weighted  mean  of  the  contributions  of  all  octave  bands  considered.  This 

method  works  with  a  calculation  scheme  ;r.d  accounts  for  distortion  in  tha  frequency 
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domain  me  band-pass -limiting  and  nolM.  It  is  not  applicable  for  distortion  in  the  time 
domain  and  for  nonlinear  distortion. 

A  method  dsvslopsd  by  fltssnsksn  and  Houtgast  (20  ]  is  bsssd  on  the  assumption  that 
transaissicn  quality  is  alossiy  rslatsd  to  tha  oapaolty  of  a  ohannsl  to  roproduea  ths 
original  sound  spsotrua.  This  can  ba  sxprssasd  by  ths  signal-to-noisa  ratio  in  a  nuabsr 
of  rslsvant  fraguanoy  bands,  sisilar  to  ths  AI  approach  but,  ths  aothod  ussd  in  ths 
measurements  dstsrainss  this  signal-to-noisa  ratio  dynaaioally  in  such  a  way  that 
distortions  in  tha  fraguanoy  domain  (non-linaaritias)  and  distortions  in  ths  tiaa  domain 
(aoboas,  ravarbaration,  AOC)  ara  aooountad  for  corraotly.  Tha  result  is  expressed  with 
on*  singls  index  ths  speech  Transmission  Indsm  (STi). 

On  basis  of  this  aothod,  a  measuring  davloa  has  baan  davalopad  for  datarmining  tha 
quality  of  spsaoh  communication  systaas.  It  ooaprisss  two  parts i 

1)  a  signal  souros  which  raplaoas  ths  talkar,  produoing  an  artifioial  apaaoh-lika 
tastsignal,  and 

2)  an  analysis  part  which  raplaoas  tha  listanar,  by  whioh  tha  signal  at  tha  raoaiving 
and  of  tha  systoa  undar  tast  it  evaluated. 

Tha  ATI  aaasuring  aguipnant,  which  ussd  originally  apodal  hardwara,  will  ba  prograaaad 
in  a  digital  signal  proosssor  systsa. 

A  oaraful  dasign  of  tha  oharaotaristios  of  tha  tsstsignal  and  of  tha  typa  of  signal 
analysis  aakas  tha  prasant  approach  widely  applicable.  It  has  bsan  verified 
experimentally  that  a  givsn  ATI  implies  a  given  affaot  on  speech  intelligibility, 
irrespective  of  tha  natura  of  tha  actual  disturbances  (noise  interference,  band-pass 
limiting,  peak  clipping,  ravarbaration,  sto.).  Tha  evaluation  of  tha  STI -method  was 
performed  for  Dutch  cvc  nonsense  words.  Anderson  and  Xalb  (1)  found  similar  results  for 
English. 

In  Pig.  1  tha  qualification  and  tha  relation  batwaan  the  STI,  a  signal-to-noisa 
ratio  for  speooh-lika  noisa  and  soma  subjective  measures  is  given.  Tha  qualification  was 
obtained  from  an  intamational  experiment  with  alavan  different  laboratories  [7,9]. 


l.t  APPLZ CATION  IZAN9LSA 

In  this  ahaptar  wa  will  give  three  examples  of  tha  evaluation  of  a  transmission 
system.  One  example  based  on  a  subjective  evaluation  for  narrow  band  aaauro  voice 
systams  and  two  examples  with  the  objective  STIt  on  a  CVSD-based  radio  link,  and  tha 
performance  of  a  boom-microphone  for  use  in  a  helicopter. 

-  Narrowband  secure  voice  terminal 

A  narrow  band  voice  terminal  ia  uaually  baaed  on  a  vocoder.  This  leans  thst  ths 
speech  aignal  ia  analysed  at  tha  transaiaaion  aide  in  much  a  way  that  m  significant 
date-reduction  ia  achieved.  A  useful  method  ia  to  determine  tha  fraguanoy  spaotrum  and 
tha  fundamental  fraguanoy  at  tha  tranaaiaslon  side,  for  instance  every  20  aa,  and  uaa 
this  information  for  remynthaala  of  tha  signal  at  tha  raoaiving  aide.  Hence  no  waveform 
ia  transmitted  but  reduced  information  of  tha  speech  signal.  In  this  case  errors  can 
occur  aonemrning  the  speotral  reproduction,  tha  voiesd/unvoioed  decision,  and  tha 
fundamental  frequency  estimation.  Tha  latter  two  distortions  exclude  the  usa  of  tha 
existing  objective  measures  and  a  subjective  method  has  to  ba  used.  Up  till  now,  a 
frequently  uaad  method  for  ths  evaluation  im  tha  diagnostic  rhyme  test  DRT.  A  more 
adequate  method  is  tha  CVC-teat  with  an  open  response  scoring.  In  Tabla  I  tha  results 
for  two  lpc  systems  and  a  ref trance  channel  are  given  according  to  oraanspan  at  al.  [9]. 
It  ia  obvious  that  tha  rank-ordar  batwaan  tha  ayatame  based  on  tha  DRT  raoulta  differs 
from  tha  initial  consonant  rasulta  and  the  subjective  opinion  scores,  oraanspan  showed 
that  this  could  ba  explained  by  tha  restrictions  of  ths  DRT-conoept. 

Tabla  I  DRT  score,  Initial  oonaonant  scoraa  and  subjective  judgement  of  one 

rafaranoa  channel  and  two  LPC  baaed  coders  [9]  (mean  -  m,  standard  error  «  a. a). 

Coder  DRT-aoore  Ci-acora  subj.  judgement 


a 

s.e 

m 

s.e 

m 

s.e 

Piltarsd,  but 
Uncodad 

99,71 

0,77 

93,8% 

0,33 

69,0% 

3,3 

codar  A 

94,3* 

0,62 

78,6% 

0,98 

94,21 

1,5 

Coder  B 

93,1% 

0,64 

61,6% 

0,94 

53,9% 

1,« 

-  CVSD  eaoura  voice  radio  link 

Continuous  Variable  Slope  Deltamodulation  (CVSD)  ia  a  waveform  based  coding.  For 
this  reason  tha  objaotiva  STI-mathod  can  ba  used  to  determine  tha  transmission  quality. 
Aa  tha  method  givaa  a  measuring  result  every  19  a,  ths  transmission  quality  osn  bs 
obtained  as  a  function  of  ths  distance  between  an  air/ground  comsunioation  link.  In  tha 
airplane  ths  prerecorded  ATI  tsstsignal  was  connected  to  the  CVSD  transmission  systsa. 
At  tha  ground  station  a  real-time  analysis  of  tho  decoded  signal  was  performed  and  tha 
STI  waa  obtained  as  a  function  of  tha  diatanoa  between  airplane  and  ground  station.  Thia 
measurement  was  performed  for  three  types  of  oodulstion  of  ths  transmitter  (bass-band, 
diphase  aid  an  analog  rsfsrsnoa  ohannsl)  as  indicated  in  Pig.  3.  When  w*  us*  a  oritarion 
of  a  ATI  of  0.39  as  tha  lower  limit  for  s  communication  channel ,  the  maximum 
communication  distance  for  these  conditions  can  bs  obtained  froa  the  graph  (23  naut. 
mils,  33  naut.  ails,  and  37  naut.  mile  respectively).  The  flight  level  was  300  ft. 


6-' 


Fig.  3  Exaapla  of  ths  STX  a  function  of  the  rang*  for  a  aacumd  CVSD 
radio  link  and  an  analog  link  between  an  airplana  and  a  ground  atation  at 
a  flight  laval  of  300  ft. 


-  Microphor  i  A arforaanoa  in  a  noiaa  environment 

Gradient  wiorophonaa  ara  davalopad  for  uaa  in  a  high  noiaa  anvironaant.  Tha 
npacif icatione,  givan  by  tha  aanufeoturara,  nonully  daacriiv'  tha  affaot  of  tha  noiaa 
reduction  in  ganaral  taraa  and  not  ralatad  to  intelligibility,  aiorophona  poaition  or 
typa  of  background  noiaa.  In  Fig.  4  tha  tranaaiaaion  quality,  axpraaaad  by  tha  ATI,  for 
two  typaa  of  aiorophor.aa  ia  givan  aa  a  function  of  tha  anvlronaantal  noiaa  laval.  For 
thaao  aaaauraaanta  an  artificial  haad  waa  uaad  to  obtain  tha  taat  aignal  acouatioally. 
Tha  aiorophona  waa  placed  on  thia  artificial  haad  at  a  representative  diatanoa  fro*  tha 
aouth.  Tha  taat  aignal  laval  waa  adjuatad  according  to  tha  noainal  npaach  laval  {but  can 
ba  inaraaaad  to  aiaulata  tha  Loabard  affaot) .  Tha  haad  waa  placad  in  a  diffuaa  aound 
tiald  with  an  adjuatabla  laval.  Fro*  tha  figura  wa  can  aaa  that  tha  diatanoa  from  tha 
aouth  ia  an  iaportant  paraaatar  and  that  tha  two  noiaa  oanoalling  aiorophona*  hava  a 
diffarant  parforaanoa  for  tha  noiaa  aa  uaad  in  thia  experiment. 


Fig.  4  «TI  aa  a  function  of  tha  noiaa  laval  for  two  diffarant 
aiorophonaa  and  two  apaaking  diatanoa*. 


i.i  injKnm  am  oajacrivi  goum  nuewim  foe  ikrce  o error  ayarstu  (Karam-ro- 

■MM) 


•poach  aynthaaia  haa  baooaa  available  alnoa  aora  than  twonty  yaara.  trait ing  witii 
aiapla  ayataaa  which  ara  abla  to  reproduce  abort  prerecorded  apaaoh  token* ,  tha  field 
haa  davalopad  to  ayataaa  converting  taat -to -apaaoh. 

7  ganaral,  apaaoh  output  ayataaa  ara  baaad  on  wavafora  coding,  atcraga  and 
r  pt  jution.  Mora  advanced  ayataaa  ara  baaad  on  tha  coding  of  apaoifli  apaaoh 
parameter*  auoh  aa  apaotral  aha pa,  fundaaantal  frequency,  ate.  Tha  lattar  aethov  raaulta 
in  a  aora  officiant  coding  but  haa  in  ganaral,  a  lowar  apaaoh  quality. 

Up  till  now  efficient  coding  load*  to  a  lowar  apaaoh  quality  and  to  aora 
flexibility,  a  tavl-to-apaach  ayataa  oan  ba  baaad  on  alaaontaiy  apaaoh  ooaponants  like 
phonuaoa  or  dlphonaa.  luoh  a  atoraga  or  a  description  of  thoae  wlaaantary  apaaoh 
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component*  opens  tha  possibility  of  reproduction  in  any  desired  order.  However,  to 
obtain  intelligible  speech  with  an  acceptable  quality  eone  other  aepecte  have  to  be 
taken  into  account.  For  instance,  the  word  atraes,  aentence  aacent,  and  the  intonation 
contours  have  a  major  effect  on  the  acceptability.  Evaluation  of  speech  output  eyatena 
is  therefore  required  to  obtain  performance  figures  and  to  obtain  more  diagnostic 
information  for  the  improvement  of  the  investigated  systeae. 

The  assessment  of  a  speech  output  system  depends  on  the  type  of  system  which  is 
-nvolved.  For  a  waveform  coder,  baaed  on  real  speech  tokens,  a  segmental  intelligibility 
test  on  phoneme  or  word  level  will  be  satisfactory .  Othar  acoaptability  items  do  not 
depend  on  the  system  itself  and  are  defined  by  the  speech  tokens  pronounoed  by  the 
talker.  However,  an  allophone  or  diphone  based  system  effects  also  tha  intonation 
contours  and  besids  a  ssgmsntal  intslligibility  tsst  a  supra-asgmsntal  test,  up  to  a 
sentsnes  lavel,  is  rsquirsd. 

Thera  ara  only  fsw  experimental  results  on  intonation.  Tarkan  and  Colliar  [28]  gave 
a  comparison  of  natural  speaah  and  two  synthetic  intonation  algorithms. 


3.2  APPLICATION  EXAMPLES 

Recently  many  rsasarch  raaulta  came  available  for  synthetic  speech  systems  on  a 
aegmantal  level  [18],  [19].  We  will  give  here  an  axampla  from  tha  speech  research 

laboratory  of  the  Indiana  university.  It  is  mora  difficult  to  obtain  reportad  raaulta  on 
auprassgmental  level  and  on  overall  quality  raaulta. 

-  sagmantul  intslligibility  of  nine  text-to-spesch  systems 

Nina  text-to-speeoh  systems  wars  evaluated  by  using  the  Modified  Rhyme  Test  (MRT)  in  two 
applications:  with  a  closed  response  format  and  with  an  open  response  format  [11].  In 
Table  II  the  error  rates  for  the  nine  systems  and  for  natural  apaeoh  are  given.  The 
error  rates  are  the  mean  of  tha  errors  for  initial  and  final  consonants. 

Table  II  MRT  overall  error  rates  for  consonants  in  initial  and  final  position  [11]. 


System  open  MRT  closed  MRT 


natural  speech 
DECtalk  1.8  Paul 
DBCtalk  1.8  Betty 
MITalk  79 
Prose  3.0 
Amiga 

infovox  8A  101 
Smoothtalker 
votrex  Type' n' Talk 
Echo 


2.78 

0.53 

12.92 

3.25 

17.50 

5.72 

24.56 

7.00 

19.42 

5.72 

42.89 

12.25 

37.14 

12.50 

56.89 

27.22 

68.47 

27.44 

73.97 

35.36 

4.1  AISHBBMXMT  METHODS  FOR  AUTOMATIC  SPEECH  RECOGNITION  SYSTEMS  ( HUMAN- TO -MACH IMS ) . 

For  an  optimal  choioe  of  a  racogniser  different  aepecte,  directly  related  to  the 
application,  play  an  important  role,  some  of  thaae  aepacta  ara: 


type  of  recognition  - 
vocabulary  - 
training  ■ 
operation  - 

raaliaation  ■ 


iaolatad  words,  connected  words,  oonnsctsd  discourss 
typs  of  spesch  to  bs  recognized 

typs  of  templates,  number  of  templates,  automatic  adjuetmant 

aignal  lavel  dependency,  sensitivity  to  temporal  variation, 
apeaker  dependancy,  noise  sensitivity,  acquisition  time 

sise,  weight,  price,  power  consumption,  interfacing. 


ror  an  optimal  choice  a  design  of  tha  application  in  a  husan-eyatam  structure  ie 
needed  [13].  The  fallowing  aspects  have  to  bs  considered: 


-  microphone  position 

-  anvironmantal  noise 

-  speaking  rets 

-  vocabulary  site 

-  vocabulary  ealection 

-  syntax  rules 

-  acceptable  error  rrte 

-  feedback  after  recognition 

-  error  correction 

-  stability  reference  patterns 

-  number  of  users 

-  user  related  learning  effects. 

A  variety  of  systems  is  availebln  but,  however,  optimal  selection  is  not  sn  easy 
job.  The  NATO  research  study  group  on  speech  procsssing  hr.*  established  a  list  of 
commercial  available  systems  on  speech  recognition  and  apaach  ayntbaaie  [12]. 
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Assessment  of  a  speech  racogniser  can  ba  performed  aithar  in  a  fiald  axpariaant 
undar  raaliatio  conditions  or  in  the  laboratory  undar  artificial  conditions.  Both 
methods  hava  thair  advantages  and  draw  backs  such  asi 


fiald  avaluation  laboratory  avaluatlon 

rapraaantativa  artificial 

uncontrolled  conditions  reproducible  conditions 

expansive  inexpensive. 

In  order  to  gain  froa  both  methods  tha  advantagas,  data-bases  for  rapraaantativa 
conditions  osn  ba  established  (recording  of  representative  speech  tokens)  and  ba  used 
many  tiaas  in  tha  laboratory.  Both  aathods  however  hava  no  predictive  power  to  other 
applications. 

External  factors  aay  influence  tha  recognition  results,  therefore  tha  avaluation  of 
a  recognition  systaa  aust  ba  parforaad  undar  controlled  and  specified  conditions.  These 
factors  oan  ba  divided  into  soaa  aain  groups.  Each  group  can  ba  divided  into  specific, 
individual,  factors.  These  groups  and  factors  aret 


-  Speech 

-  speaker 


-  Task 

-  Environment 

-  Input 

-  Recognizer 


isolated  worda 
connected  worda 
connected  discourse 
speaker  dependency 

speaker  within/outside  reference  patterns 

age,  sax,  accent,  native  language 

recording  conditions 

vocal  effort 

speaking  rate 

language  dependency 

■isa,  redundancy  vocabulary 

complexity  of  syntax 

noise 

reverberation 
co-channel  interference 
aicrophone 
system  noise 
distortion 

systaa  parameters,  thresholds 
training. 


As  a  function  of  all  these  parameters  one  oan  determine  the  percentage  of  correctly 
recognised  words.  For  nany  applications  however  this  is  not  sufficient.  We  also  need  to 
know  the  number  of  confusions  end  rejections  separately.  For  an  isolated  word  recognizer 
for  instance  the  following  performance  measures  can  be  determined: 

-  Words  inside  voaab.  percentage  correot 

percentage  rejected 
percentage  incorrect 

-  Worda  outside  vocab.  percentage  rejected  (which  is  correct) 

percentage  incorrect  (all  positive  responses) 


-  Confusions  between  words  inside  and  outside  vocabulary 

-  Predictive  measures  (to  be  discussed  later) . 


For  connected  word  recognizers  an  additional  measure  can  be  added: 


-  percentage  insertions 

percentage  deletions. 

There  are  several  ways  to  determine  these  percentages.  A  frequently  used  method  is 
given  by  Hunt  (B). 

Significance  of  the  performance  of  different  recognizers  osn  be  tested  by  means  of 
statistical  tests  such  as  the  analysis  of  variance  ANOVA  or  the  HoNesar  test  [3].  As  for 
some  vocabularies  a  very  low  error  rate  is  obtained,  the  application  of  a  statistical 
test  requires  a  very  high  number  of  trials  to  get  signifi 'ant  results.  In  our  opinion  a 
more  difficult  vocabulary  would  be  more  adequate.  This  is  similar  to  the  relation  as 
obtained  between  CVC-worda  and  short  sentences  for  intelligibility  testing. 

The  performance  measures  as  given  above  are  very  dependent  on  the  vocabulary,  number 
of  speakers,  training  etc.  A  more  general  measure,  independent  of  the  vocabulary,  is  to 
determine  how  human  listeners  recognize  tha  same  vocabulary  with  the  sane  recognition 
score  but  for  the  condition  that  the  test  words  are  masked  by  noise.  The  level  of  the 
noise  required  for  an  identical  score  as  the  recognizer  is  called  the  human  equivalent 
noise  level  (14).  Such  a  noise  level  opens  tha  possibility  to  compare  results  for 
different  vocabularies  according  to  the  intelligibility  measures  as  given  before  in 
chapter  3.1. 

International  standardisation  of  assessment  methods  is  a  necessity  for  getting 
comparable  results.  Some  yerrs  ago  tha  already  mentioned  NATO  research  study  group  RSO- 
10  has  established  a  data-base  for  isolated  and  connected  digits  and  for  native  and  non- 
native  talkers.  This  data-base  has  been  used  for  many  experiments  at  different 
locations.  An  example  of  an  evaluation  using  this  data-base  is  given  in  chapter  4.3. 
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Som  other  bodies  working  on  test  standardisation  ara  KSPRXT-SAM  and  Mat.  Znat.  of 
Standards  and  Technology  (NIST,  formerly  HBS) .  Both  bodiea  have  established  speech  data¬ 
bases  on  a  CD-ROM.  RSQ-io  is  recording  its  already  existing  noise  data-base  on  CD-ROM 
[25].  Also  the  standardisation  of  speech  level  measures,  in  order  tD  specify  signal-to- 
noise  ratios  in  a  reproducible  manner,  is  under  consideration  [23]. 

A  method  where  the  reoogniier  performance  la  apealfied  as  a  function  of  the 
variation  of  specific  speech  parameters  and  environmental  conditions  was  proposed 
recently  [26].  The  method  uses  a  small  test  vocabulary  with  minimal  ditferenoa  word-sots 
of  cvc-type  words.  Training  and  scoring  are  according  to  an  open  response  experimental 
design.  This  results  in  valuable  diagnostic  properties.  By  means  of  an  analysio- 
resynthasis  technique,  the  testwords  can  be  phyaioally  manipulated  according  to  ohaiMes 
of  human  speech  in  well  defined  conditions.  Por  this  ptrposs  a  cook-book  is  under 
development  which  describes  the  relevant  parameters  and  amount  of  variation  for 
conditions  like  inter  and  intra  speaker  variability,  male/femsle,  normal/stressed  ato. 


4.2  APPLICATION  BZAMPLSS 

The  recognition  rate  depends  vary  much  on  the  acceptance  criterion  between  the  fit 
of  the  beat  matching  template  and  the  apeaoh  token.  Tha  higher  the  aooeptanoe  criterion 
the  lover  the  number  of  correct  responses,  tha  highsr  the  number  of  reject ion a  but  alas 
the  lower  the  number  of  false  responses.  In  Pig.  »  an  example  is  given  for  an  isolated 
word  raaognissr  trained  with  tl  words.  The  figure  shows  the  recognition  rate  as  a 
function  of  the  acceptance  threshold  (solid  11ns) ,  the  figure  gives  also  tha  rata  of 
falsa  response  of  words  outside  the  vocabulary  (this  is  here  tha  rate  of  the  second 
choice  responses) .  An  optimal  adjustment  for  this  racogniasr  with  this  vocabulary  is 
around  a  threshold  sotting  of  approx.  MltH  where  the  bast  separation  between  a  nigh 
correct  response  saore  and  a  low  false  response  score  is  achieved. 

speech  recognition  for  connected  words  and  for  talkers  using  a  language  other  than 
their  native  language  is  s  problsa  that  ariaaa  in  a  Multinational  uoasunity  like  MATO, 
Therefore  Rsu-io  conducted  an  experiment  with  non-native  talker*  and  with  uonnoutad  and 
isolatad  digits  [15].  In  Pig.  6  the  error  rata  for  k  racogniaers  and  humans  is  given  for 
groups  of  digits  of  l,  3,  4,  and  5  connected  digits  respectively.  It  is  obvious  the  sera 
digita  there  are  in  a  group  tha  sore  likely  it  la  that  there  will  be  a  recognition 
error.  All  syetese  show  e  good  performance  for  isolated  digita.  Por  uonnsutad  digita 
aosa  rscogniaara  ahow  a  poor  parforoanoo. 


Pig.  9  Aaooonition  rata  aa  a  lunation  of  the  auusptanuo  threshold  for 
correct  responses  (solid  line)  and  falsa  raapenaas  (dotted  line). 


it  was  found  In  this  stuoy  that  a  alightly  different  raault  is  obtained  for  mala  and 
famala  voices,  however  some  rajogniiera  perform  better  with  female  vulos  while  ether  da 
batter  with  male  voices. 

The  effect  of  language  and  nstivs/non-nstivs  talkers  speaking  Mnglish  digits  is  very 
signifiosnt.  The  individual  speaker  variation  however  explains  mors  variance  than  any 
other  paramatar.  similar  result*  ware  found  by  staansksn,  Tomlinson  and  uauvsin  (17], 


Midi  B  Ilf  tut  Hf  group  of  uannoutod  digifca  an  lha  raougnitlun  arror 
rata  far  fiv#  uunhautod  apaauh  rauugnliara  and  huaani  (lB). 


I  HHA1,  HUUAJtM  AMO  OUMbMlOM 

loaa  aMoaplaa  uf  avaluaniun  oalhoda  uf  apaauh  ayataaa  hav*  boon  givan,  «avar»l  itaao 

far  futura  davalupaont  ware  idontiflod,  auuh  ti  avaluatiun  aotitahUa  laval  fur  apaauh 
output  ayataaa  and  otojaativo  evaluation  oatltoda  far  apaauh  Input  ayataaa, 

tha  availability  uf  atandardlaad  avaluatinn  aathodo  and  data-baaaa  inuraaaaa  tha 
riaaiblllty  to  oaapara  raaulta  ftoa  different  atudlaa. 

An  aapeut  nut  diauuaaod  in  thle  review,  but  ralavant  rut  tha  appiiuatinn  of  apaauh 
Input/output  ayataaa  in  noato l nation  with  woaputar  ayataaa,  la  tha  dialogue  atruutura  and 
tha  aan-aauhlne  intarfaulng. 


•  unuMii 
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ABSTRACT 


This  paper  deals  with  speech  processing  standards  for  64,  32,  16  kb/s  and 
lower  rate  speech  and  more  generally,  speech-band  signals  which  are  or  will 
be  promulgated  by  CCITT  and  NATO.  The  International  Telegraph  and  Telephone 
Consultative  Committe  (CCITT)  of  the  International  body  which  deals,  among 
other  things,  with  speech  processing  within  the  context  of  ISDN.  Within  NATO 
there  are  also  bodies  promulgating  standards  which  make  interoperability, 
possible  without  complex  and  expensive  interfaces. 

The  paper  highlights  also  some  of  the  applications  for  low-bit  rate 
voice  and  the  related  work  undertaken  by  CCITT  Study  Groups  which  are 
responsible  for  developing  standards  in  terms  of  encoding  algorithms,  codec 
design  objectives  as  well  as  standards  on  the  assessment  of  speech  quality. 

1.  STANDARDS  ORGANISATIONS 

The  dictionary  meaning  of  "Standards"  as  applied  to  Telecommunications  is 
"Something  established  by  authority,  custom  or  general  consent  as  a  model  or 
example".  Standards  are  known  by  different  names  depending  on  the  source,  for 
example,  standards,  specifications.  Regulations,  Recommendations.  By  any 
name,  their  purpose  is  to  achieve  the  necessary  or  desired  degree  of 
uniformity  in  design  or  operation  to  permit  systems  to  function  benefically 
for  both  providers  and  users.  The  intended  scope  of  standards  can  vary.  They 
may  be  internal  within  a  company  or  they  may  apply  to  an  entire  country,  a 
world  region,  or  the  world  as  a  whole. 

This  paper  deals  only  with  international  (global  and  regional  CEPT  and 
NATO)  standards  as  far  as  speech  processing  systems  are  concerned. 
International  Standards  organisations  are  of  two  types,  treaty  based  and 
voluntary. 

The  treaty  based  world  organisation  is  the  International 
Telecommunication  Convention,  a  multilateral  treaty.  The  International 
Consultative  Committee  for  Telegraph  and  Telephone  (CCITT)  and  the 
International  Consultative  Committee  for  Radio  (CCIR)  are  the  two  technical 
organs  of  the  ITU  involved  in  standards  making. 

The  standardization  activities  of  the  Conference  of  European  Posts  and 
Telecommunications  Administrations  (CEPT)  supplement  the  action  of  the  CCITT. 
Generally  a  preliminary  agreement  among  European  countries  enables  common 
proposals  to  be  introduced  which  make  the  work  of  CCITT  study  groups  easier 
and  quicker.  In  other  cases,  where  international  Recommendations  offer 
several  choices,  the  CEPT  encourages  its  members  to  adopt  the  same  solution. 
In  addition  new  systems  jointly  studied  by  several  European  countries  also 
form  the  subject  of  Recommendations  at  the  international  level.  Finally,  the 
CEPT  is  to  define  a  common  system  for  the  approval  procedures  applying  to 
terminal  equipment.  All  these  activities  result  in  the  drafting  of 
Recommendations  by  the  CEPT.  However,  in  terms  of  its  activities,  the  CEPT 
never  completes  or  opposes  the  action  of  the  CCITT  or  any  other  international 
organisation. 

Military  Standards  both  procedures  and  materials  required  by  the  member 
nations  to  enable  their  forces  to  operate  together  in  the  most  effective 
manner  are  evolved  in  NATO  by  various  committees,  and  groups  and  are 
promulgated  by  the  "Military  Agency  for  Standardisation"  (MAS)  in  the  form  of 
NATO  Standardisation  Agreements  (STANAG'sl. 

The  so-called  "Voluntary  or  Industry  Standards"  are  documents  prepared  by 
nationally  recognized  industrial  and  trade  associations  and  professional 
societies  for  use  by  the  general  public.  Most  of  these  "standards"  usually 
feed  into  the  work  of  the  international  standards  organisations. 

c.  WORKING  METHODS  OF  THE  CCITT 


The  primary  objectives  of  the  CCITT  are  to  standardise,  to  the  extent 
necessary,  techniques  and  operations  in  telecommunications  to  achieve 
end-to-end  compatibility  of  Internationa  1  telecommunication  connections, 
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regardless  ct  tho  oouhtrlsa  of  origin  and  daatination.  CCITT  Standard*  are 
useful  1  also  for  national  applications  and  in  moat  oountrlaa  today  national 
and  avail  lucai  equipment  comply  with  CCITT  Standard*.  In  developing 
Standard*'  the  CCITT  la  required  by  ita  rule*  to  invite  other  organiaationa 
to  give  specialist  advice  on  subjects  that  are  of  mutual  lntereat.  Thus,  vary 
close  cooperation  ia  enaurad  and  account  taken  of  the  work  done  by  other 
organisations  aunh  m  iso  and  IRC. 

The  main  principles  of  tha  working  proceduraa  of  tha  CCITT  are  aet  out  in 
tha  International  Telecommunication  Convention  wheraaa  tha  detailed 
proceduraa  art  contained  in  various  raeolutiona  of  tha  CCITT  Plenary 

Assemblies  (11. 

The  work  program  of  tha  CCITT  in  the  various  domains  such  as  transmission' 
switching  ate.  ik  established  at  avary  Plenary  Assembly  in  tha  form  of 
Questions  submitted  by  tha  various  Study  Croups  baaed  on  raquaati  made  to  the 
Study  Croups  by  knair  member a.  The  Plenary  Assembly  asasaaas  tha  varioua 
Study  Questions'  reviews  tha  scope  of  the  Study  Croupe,  and  allocatea 
Questions  te  them.  The  Study  Croups  organiae  their  work  (that  ii  which 
Question*  era  tb  he  dealt  with  by  the  Plenary  of  tha  Study  Croup,  by  a 
working  Party,  a  special  kapporkeur*'  group, or  an  ad  hoc  group)  and  appoint 
the  ehairmani  Special  MapporteurS'  ate. 

Work  on  CCITT  Study  Question  normaly  loads  to  one  or  several  draft 
Nsuommemlatioha  to  be  Submitted  for  appruvai  to  the  nest  Plenary  Assembly. 
All  kawommendatitinS'  new  ur  amehded,  are  printed  in  the  varioua  volumes  of 
the  CCITT  kook  after  approval. 

The  present  CCITT  Study  Croupe  together  with  their  areas  of  interest  are 
given  in  Tehle  I  below. 


Table  i 

I'CITT  Study  Croups  and  Their  Areas  uf  Responsibility 
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I  Iwt lull  ion  and  operational  aspects  of  telegraph  and 

telematic  sat  Vines  I  such  as  facsimile,  telex,  ami  videotex) 


II  Telarhwne  uparatlun  and  quality  of  service 

III  tieneial  tariff  principles 

IV  riahaMlxsIon  maintenance  of  International  lines,  alruuitn, 
end  chains  of  circuital  maintenance  of  automatic  networks 


V 


Pi  elect luh  against  dangers  and  disturbances  of 
a  led  it.magnel  ic  origin 


VI  I'lolecilun  and  aped f  1<  al  Ion*  of  cable  aheaix  and  polua 

VII  Pals  communication!  natwutka 


VIII  UH IV) 
Ik  Ik  Ml 
Ml 
Nil 


Terminal  equipment  for  telematic  services 

Telegraph  networks  and  terminal  equipment 

Telephone  awliohing  and  signalling 

Telephone  transmission  performance  and  local 
nalwuika 


telephone 


kV 

kVI 

kVIl 

kvm 


Transmission  systems 
Telephone  circuits 

Wet*  communications  over  tha  telephone  network 
Hinltal  networks 
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In  the  Vt  ITT, eayara)  Study  Croups  are  Involved  In  speech  prcceealng 
airiudardltetien  activities.  Study  Oruup  mil's  Working  Party  on  Speech 
Peoeeaaing  ia  veapenatble  for  setting  up  the  standards  in  terms  of  encoding 
elgurlthma  end  related  cedes  design  objectives,  Standerda  on  the  eeaeaament 
of  speech  quality  fell  under  the  leapuna Utility  of  study  Croup  Nil.  Standards 


on  natwork  objectives,  to  whioh  the  codec  design  and  performance  should 
comply,  are  the  subject  of  study  for  standardisation  by  SJtudy  Group  XVIII 
(digital  networks)  and  Study  Group  XV  (mixed  analog-digital  networks) . 

The  Working  Party  on  Speech  Processing  of  study  Group  XVI I I  has  been 
acting  for  several  Study  Periods  (a  four  year-  time  period)  as  the 
coordinating  body  that  plans  the  various  steps  of  the  standardisation  process 
addressed  to  both  new  technologies  for  natwork  transport  (e.g.,  adaptive 
differential  pulse  code  modulation  (ADPCM)  at  32  kbits/s)  and  naw 
technologies  in  support  of  new  services  capabilities  (e.g.,  aoding  of  wide¬ 
band  speech, i.e.,  7  kHz,  within  £4  kbits/a). 

The  procedure  to  obtain  a  consensus  on  the  standard  processing  algorithms 
consists  of  a  technical  selection  among  competing  candidate  codecs.  Selection 
is  dons  on  ths  basis  of  series  of  subjective  (to  assess  speech  quality)  and 
objactiva  laboratory  testa  (  voiceband  data  quality)  on  prototype  codecs. 
Tests  are  carried  out  in  different  world  locations  according  to  standard 
CCITT  measurement  conditions  and  procedures.  They  aim  to  varify  codac 
performance  under  realistic  natwork  environmental  conditions.  Typical  test 
conditions  include 

-  single  encoding, 

-  single  encoding  with  injected  digital  errors  with  random  or  bursty 
arrival  statistics, 

-  synchronous  and  asynchronous  tandsm  encoding  for  up  to  eight  links  in 
tsndem  (synchronous  refers  to  digital-to-digital  tandsm  aneodiny 
betwaen  64  kbit/a  PCM  and  anothsr  digital  ceding  format)  asynchronous 
tandem  encoding* involve  analog  aignal  representation  between  successive 
encodings) , 

-  asynchronous  tandsm  encoding* with  injected  analog  impairment*  (noise, 
losa,  amplitude  and  delay  distortion,  phats  jitter,  harmonic 
distortion) ,  thesa  conditions  being  critical  for  voiceband  data 
performance. 

Voice  quality  is  based  on  subjective  listening  tests  with  absolute 
judgment  scores.  This  test  uses  a  five-point  scale  end  is  based  on  mean 
opinion  score  (MOS)  judgements  under  defined  teat  conditions.  These 
conditions  include i  source  spaaah,  reference  conditions  (whits  noise,  speech- 
correlated  noise,  handset  characteristics) ,  digital  error  ganaration,  and 
teat  administration. 

Voiceband  data  quality  is  assessed  in  the  same  network  environment  as  for 
voice  on  a  variety  of  CCITT-apeoif led  modems  and  facsimile  equipment.  Quality 
is  measured  on  the  basis  of  bit  or  block  error  performance. 

Selection  activities  are  conducted  by  a  group  of  exparts  consisting 
raprasantatives  from  administrations,  operating  agencies,  and  manufacturers, 
that  must  establish  a  multilaboratory  test  workplan,  evaluate  the  obtained 
performance,  and  finally  agree  on  a  specific  codec  algorithm  by  taking  into 
account  other  aspects  such  as 

.  codec  complexity, 

.  codec  delay, 

.  ease  of  transcoding  with  PCM, 

.  amenability  to  variable  rata  coding. 

It  is  to  be  noted  that  standards  result  from  the  selection  among 
competing  systems,  and  they  also  incorporate  some  of  the  best  features  of 
their  competitors.  Another  point  that  should  be  mentioned  about  standards  is 
that  they  sometimes  turn  out  not  to  be  satisfactory  in  the  field tft which  case 
they  aru  reconsidered  and  modified. 

3.  CCITT  SPEECH  PROCESSING  STANDARDS 

Before  we  discuss  recent  and  future  CCITT  standardisation  activities  we 
should  mention  that  the  first  and  the  most  significant  milestone  in  speech 
processing  standards  was  achieved  at  the  end  of  the  1960's  (amended  in  1972) 
with  the  promulgation  of  the  CCITT  Recommendations  G.711  concerning  "Pulse 
Code  Modulation  (PCM)  of  voice  frequencies"  (2) .  This  recommendation 
specified  (together  with  Rac. G. 702)  64  kb/s  PCM  coding  using  a  sampling  rate 
of  8000  samples  per  second  and  two  encoding  laws  commonly  referred  to  as  the 
A-law  and  the  u-law.  PCM  which  dominates  speech  processing  applications  In 
today's  networks  has  a  high  degree  of  robustness  to  transmission  errors  and 
tandem  encodings  and  offers  satisfactory  performance  to  speech  and  voiceband 
data  in  moat  mixed  applications. 

Advances  made  in  previous  years  in  digital  signal  processing  eventually 
cauaad  in  tha  study  period  19B2-84  the  initiation  of  standardisation 
actlvites  in  CCITT  in  Speech  processing  (3)  and  the  setting  up,  under  Study 
Group  XVIII,  of  a  working  Party  to  deal  with  the  establishment  of  auoh 
standards. 


The  standardisstion  activities  o£  CCITT  in  speech  processing  may  be 
divided  into  two  groups: 

a)  those  related  to  the  so-called  "low-bit-voice",  (LBRV)  which  aim  at 

overcoming , in  the  short-to-medium  terms,  before  the  widespread  use  of 

the  emerging  optical  fibre,  the  economic  weakness  of  64  kb/s  PCM  in 
satellite  and  long-haul  terrestrial  links  and  copper  subscriber  loops. 

and  b)  those  associated  with  ”  high-fidelity  voice"  (HFV)  with  bandwidth  up 

to  7  kHz  for  applications  such  as  loudspeaker  telephones, 

teleconferencing  and  commentary  channels  for  broadcasting. 

The  standards  that  have  been  issued  and  the  ones  on  which  work  is  still 
in  progress  are  outlined  below  for  the  cases  (a)  and  (b)  above. 

3.1.  32  kb/s  Adaptive  Differential  Pulse  Code  Modulation  ( ADPCM) 

The  latest  version  of  CCITT  Recommendation  G.721  (4,5)  (first  approved  in 
1984  and  revised  in  1986)  Specifies  standards  for  the  conversion  of  a  64  kb/s 
A-law  or  y-law  PCM  channel  to  and  from  a  32  kb/s  channel.  In  the  ADPCM 
encoder,  the  A/p-law  PCM  input  signal  is  first  converted  into  uniform  PCM 
and  then  a  difference  signal  is  obtained,  by  subtracting  an  estimate  of  the 
input  signal  from  the  input  signal  itself.  An  adaptive  15-level  quantiser  is 
used  to  assign  four  binary  digits  to  the  value  of  the  difference  signal  for 
transmission  to  the  coder.  An  inverse  quantiser  produces  a  quantised 
difference  signal  from  the  same  four  digits.  The  signal  estimate  is  added  to 
this  quantised  difference  signal  to  produce  the  reconstructed  version  of  the 
input  signal.  Both  the  reconstructed  signal  and  the  quantised  difference 
signal  are  operated  upon  by  an  adaptive  predictor  which  produces  the 
estimate  of  the  input  signal,  thereby  completing  the  feedback  loop. 

The  ADPCM  decoder  includes  a  structure  identical  to  the  feedback  portion 
of  the  encoder,  together  with  a  uniform  PCM  to  A-law  or  p-law  conversion  and 
a  synchronous  coding  adjustment.  The  synchronous  coding  adjustment  prevents 
cumulative  distortion  occuring  on  synchronous  tandem  codings  ( ADPCM-PCM- ADPCM 
etc.  digital  connections)  under  certain  conditions.  The  synchronous  coding 
adjustment  is  achieved  by  adjusting  the  PCM  output  codes  in  a  manner  which 
attempts  to  eliminate  quantising  distortion  in  the  next  ADPCM  encoding  stage. 

The  perceived  quality  of  speech  over  32  kb/s  ADPCM  links  is  comparable  to 
64  kb/s  PCM  for  up  to  two  asynchronous  codings,  slightly  poorer  for  four 
codings  and  significantly  worse  with  eight  codings.  It  is  clear  that  the 
deployment  of  asynchronous  tandem  codings  of  ADPCM  in  the  network  must  be 
limited.  CCITT  have  adopted  a  voice  criterion  which  allows  a  maximum  of  four 
asynchronous  ADPCM  codings  on  an  end-to-end  connection  if  there  is  no  other 
source  of  quantizing  distortion.  In  addition,  CCITT  Recommendation  G.113 
allows  one  ADPCM  coding  in  the  national  network  on  the  national  extension  of 
an  international  connection.  On  the  other  hand,  ADPCM  is  more  robust  than  PCM 
in  the  presence  of  random  bit  errors. 

Block  Error-rate  (BEER  with  1000  bits  inablock)  test  results,  carried  out 
with  random  additive  noise  (and  some  other  added  analog  impairments  such  as 
delay  distortion,  non-linear  distortion  and  phase  jitter) ,  for  2400  b/s  v.26 
and  4800  b/s  V .21  modems  show  that,  with  an  acceptability  criterion  of  a  10~2 
BLER  at  an  S/N  of  24  dB,  ADPCM  provides  an  acceptable  level  of  performance 
with  both  modems  and  with  four  asynchronous  codings.  As  expected,  the 
degredation  with  ADPCM  relative  to  PCM  is  more  pronounced  with  the  higher 
speed  V.27  signals.  Performance  of  9600  b/s  V.29  is  not  acceptable  for  even 
one  ADPCM  coding. 

For  applications  where  bit  error  rate  (BER)  is  recognised  as  an  important 
criterion,  the  acceptability  limit  is  BER  <10”-’  .  This  is  almost  always  a 
more  stringent  constraint  than  BLER.  Using  this  criterion,  some  modems 
provide  acceptable  performance  with  only  two  or  three  asynchronous  codings  at 
the  4800  b/s  rate.  In  general,  the  impact  on  voiceband  data  performance  is 
considerable  even  when  limiting  criteria  are  met. 

Classical  transmission  measurements  such  as  S/N  must  be  interpreted  with 
care  for  adaptive  signal  processing  algorithms  such  as  ADPCM,  since  S/N 
typically  depends  on  input  signal  statistics.  In  other  words,  such 
measurements,  in  general,  cannot  be  used  to  predict  performance  for  other 
signals  with  significantly  different  spectral  and  temporal  characteristics. 

3.2.  7  kHz  Audio-Coding  Within  64  kb/s 

The  CCITT  Recommendation  0.122  (6,7)  describes  the  characteristics  of  an 
audio  (50  to  7000  Hz)  coding  system  which  may  be  used  for  a  variety  of  higher 
quality  speech  applications.  The  coding  system  uses  sub-band  adaptive 
dlf ferential  pulse  code  modulation  (SB-ADPCM)  within  a  bit  rate  of  64  kb/s. 
tn  the  technique  used,  the  frequency  band  is  split  into  two  sub-bands  (higher 
and  lower)  and  the  signals  in  each  sub-band  are  encoded  using  ADPCM.  The 


system  has  three  basic  mode  of  operation  corresponding  to  the  bit  rates  used 
for  7  kHz  audio  coding.  64,  56  and  48  kb/s  which  are  the  subjects  of  Draft 
Recommendation  G.72y  and  Y.221  (Frame  structure  for  a  64  kb/s  Channel  in 
Audio-Visual  Teleservices)  having  other  speech  bit  rates,  or  data  rates  up  to 
a  full  64  kb/s  data  path. 

The  64  kb/s  (7  kHz)  audio  encoder  comprises  a  transmit  audio  part  which 
converts  the  audio  signal  to  a  uniform  digital  signal  which  is  coded  using  14 
bits  with  16  kHz  sampling  and  a  SB-ADPCM  encoder  which  reduces  the  bit  rate 
to  64  kb/s. 


The  corresponding  decoder  comprises  a)  a  SB-ADPCM  decoder  which  performs 
the  reverse  operation  to  the  encoder  noting  that  the  effective  audio  coding 
bit  rate  at  the  input  of  the  decoder  can  be  64,  56  or  48  kb/s  depending  on 
the  mode  of  operation;  and  b)  a  receive  audio  part  which  reconstructs  the 
audio  signal  from  the  uniform  digital  signal  which  is  encoded  using  14  bits 
with  16  kHz  sampling. 

For  applications  requiring  an  auxiliary  data  channel  within  the  64  kb/s 
the  following  two  parts  are  needed; 

-  a  data  insertion  device  at  the  transmit  end  which  makes  use  of,  when 
needed,  1  or  2  audio  bits  per  octet  depending  on  the  mode  of  operation  and 
substitutes  data  bits  to  provide  an  auxiliary  data  channel  of  8  or  16  kb/s 
respectively; 

-  a  data  extraction  device  at  the  receive  end  which  determines  the  mode 
of  operation  according  to  a  mode  control  strategy  and  extracts  the  data  bits 
as  appropriate. 

The  wideband  speech  algorithm  outlined  above  was  selected  by  assuming 
end-to-end  digital  connectivity  and  excluding  the  requirement  of  voiceband 
data  transmission  or  asynchronous  tandem  encodings  (synchronous  transcoding 
to  and  from  uniform  PCM  to  provide  conference  bridge  arrangements  is 
required) . 


To  allow  switching  among  64,  56,  and  48  kb/s  speech  coding  rates,  the 
lower  subband  (0-4000  Hz)  ADPCM  coder  is  designed  to  operate  at  6,  5  or  4 
bit/sample.  Embedded  coding  is  used  to  prevent  quality  degradation  in  case 
of  a  mismatched  mode  of  operation  between  the  encoder  and  decoder. 

The  subjective  evaluation  tests  conducted  with  the  standard  algorithm  in 
terms  of  average  MOS  versus  the  three  encoding  bit  rates  at  different  BER 
show  (7)  that  when  BER  is  better  than  10"4  MOS  stays  around  the  value  of  4 
increasing  slightly  with  bit  rate  whereas  at  BER=  10"3  the  MOS  remains  almost 
constant  with  bit  rate  at  the  value  of  3.  With  four  synchronous  transcodings, 
MOS  changes  from  about  3  to  4  with  bit  rate  for  BER  3>10"4 

3.3.  Draft  CCITT  Recommendation  G.72z 


This  recommendation  which  will  probably  be  numbered  G.723,  extends 
Rec.G.72l  to  include  the  conversion  of  a  64  kb/s  A-law  or  u-law  PCM  channel 
•to  and  from  a  24  kb/s  or  40  kb/s  channel  (8).  The  principal  application  of  24 
kb/s  channels  is  for  overload  channels  carrying  voice  signals  in  Digital 
Circuit  Multiplication  Equipment  (DOME).  40  kb/s  channels  are  used  mainly  for 
carrying  data  modem  signals  in  DCME,  especially  for  modems  operating  at 
greater  than  4800  b/s  (32  kb/s  channels  do  not  perform  well  with  9.6  kb/s 
modems) • 


DCME  makes  use  of  digital  speech  interpolation  (DSI)  and  low-bit-rate 
voice  techniques  to  increase,  with  respect  to  64  kb/s  PCM,  the  number  of 
simultaneous  voice  calls  transmitted  over  a  digital  link  (9).  DSI  takes 
advantage  of  limited  voice  activity  during  a  call  (less  than  40%  of  the  time) 
and  transmits  only  the  active  parts  of  a  conversation  (talkspurts) .  The 
channel  capacity  is  allocated  to  talkspurts  from  other  conversations  during 
silent  intervals.  The  use  of  variable  bit  rate  coding  of  talkspurts  avoids 
effectively  the  annoying  "freeze-out"  effect  which,  if  allowed  to  occur, 
would  result  in  the  loss  of  a  talkspurt  as  a  consequence  of  excessive  traffic 
load  on  the  digital  link. 

G.72z  recommends  that  when  using  32  kb/s  ADPCM,  coding  should  be 
alternated  rapidly  to  24  kb/s  such  that  at  least  3.5  to  3.7  bits/sample  are 
used  on  average  (for  further  Btudy) .  The  effect  on  speech  quality  of  this 
alternation  is  not  expected  to  be  significant.  The  use  of  24  kb/s  codinq  for 
data  transmission  is  not  recommended. 


Tests  conducted  indicate  that  for  voice  the  40  kb/s  ADPCM  coding  performs 
approximately  as  well  as  64  kb/s  PCM  according  to  Rec.  G.711.  Voice  band  data 
at  speeds  up  to  12000  bits/s  can  be  accommodated  by  40  kb/s  ADPCM.  The 
performance  of  V.33  modems  operating  at  14400  bit/s  over  40  kb/s  ADPCM  is 
tor  further  study. 


7-6 


Under  normal  DOME  operating  conditions,  no  significant  problems  with  DTMF 
signalling  or  with  Qroup  2  and  3  facsimile  apparatus  are  expected  (9) . 

There  are  three  modes  of  DCM  operation  so  far  identified t 

••  Point  -to-point  mode 

-  Multi-Clique  Mode  (based  on  a  limited  multidestinational  capability, 

ith  perhaps  fixed  but  relatively  small  bearer  capacities) 

-  Pull  Multi-Point  Mode  (based  on  fully-variable  capacity  allocation  of 
multi-destinational  bearer  ahannela). 

A  review  of  the  activities  of  various  CCITT  study  groups  and  of  national 
bodies  shows  that  current  plana  provide  the  means  within  DOME  to  aocommodate 
the  hearer  services  defined  in  Rea,  1,211  Red  Book  sections  2.1.1  64-kb/s 

unrestricted,  2.1,2  64-kb/s  useable  for  speech,  2.1.3  64-kb/a  usable  for  3.1 
kHz  audio,  end  2.1.4  alternate  speeoh/64-kb/a  non-apeech. 

There  are  aevaral  iasuee  concerned  with  DCM  implementation  which  are 
being  addreased  as  Question  31/ XV ill  by  ccitt  Working  Party  XVHi/8. 

3.4  Other  CCITT  Activities  for  Future  Standards 

CCITT  and  other  organisations  (CEPT,  Intelsat,  Inmarsat,  eta.)  have 
astablianed  a  number  of  network  applications  which  require  speech  bit  rates 
less  than  32  kb/s.  As  has  bean  pointed  out  elsewhere  in  the  leoture  series, 
16  kb/a  is  the  lowest  bit  rate  today  giving  high  quality  of  spaeoh  although 
coders  operating  «t  lower  speeds  exists  which  give  adequate  quality  for 
applications  in  a  circuit-orianted  network  environment  or  in  packet  networks. 

Ths  following  mein  applications  for  the  16  kb/a  speech  coding  have  bcien 
Identified  by  CCITT  Working  Party  xvlll/8  Question  27/xvnit 

i)  Land  Digital  Mobile  Radio  (DMR)  system  and  portable  telephone) 

ii)  Low  C/N  digital  aatallita  ayatemo,  This  include  maritime  thin-route 
and  single  channel  per  carrier  satellite  systems) 

iii)  DOME.  In  this  equipment  low  bit  rate  encoding  is  generally  combined 
with  list .  The  equipment  may  be  used  for  long  terroutrlel  connections 
end  for  digital  satellite  links  generally  charaatnrUed  by  high  C/N 
ratios i 

i v >  t'OTN i  This  application  covers  the  encoding  of  voice  telephone 
channels  In  trunk,  junction  or  distribution  outwork) 
v)  I ODN i  This  application  is  similar  to  that  foreseen  in  PHTN,  heing 
understood  that  in  tills  cast  end-to-end  digital  connectivity  at  G 4 
or  128  kb/s  la  available  for  multimedia  applications  such  as  video 
telephones  (eg,,  18  kb/s  voles  and  48  kb/s  video), 
vl)  Digital  leased  lines.  Two  possibilities  may  be  enviesged  in  this 
uaeei  one  Is  where  the  end-to-eml  digital  leased  circuits  include 
only  one  encodlng/deioding,  the  other  la  where  the  end-to-end 
digital  leased  circuits  are  connected  into  the  public  network  and 
they  may  include  digital  tranaaudingsi 

vll)  Mtore  and  Forward  systems) 
vlll)  Voice  messages  for  recorded  announcements . 

It  has  been  agreed  that  CCITT  should  play  the  role  of  overall 
coordinator  of  activities  related  to  the  above  applications  In  order  to 
assist  the  various  organisations  In  thslr  studies  In  sreas  uf  common 
Intrrsst.  This  would  allow  the  soli  lavement  of  consistency  between  the 
performance  requirements  of  the  specif  la  soul) cation  and  that  of  ths  overall 
network.  This  oimd  ins  ting  role  is  especially  required  in  the  definition  of 
sensitive  networking  topics  such  as  speech  quality  objectives,  capability  in 
terms  of  cascade  transcoding  end  processing  delay.  The  ieaue  of  CCITT 
guidelines  on  the  foremen! toned  topics  would  help  tu  ensure  International 
network  interconnections  with  estisfectory  overall  performances, 

Tne  organisation  Involved  In  the  early  identification  of  the  speech 
veiling  algorithms  fur  specific  applications  have  been  invited  tu  provide 
CCITT  with  punctual  informations  on  networking  issues  and  in  particular  tu 
Indluate  their  speech  quality  objectives  in  terms  currently  used  In  CCITT 
leg.  qdu  and/or  Moai, 

The  network  performance  parameter*  collected  to  date  fur  the  varlmia 
applications  are  summarised  In  table  II, 

II  ie  Important  to  nute  that  there  are  different  prioritise  attached  to 
the  different  applications  and  that  if  the  urgent  raqutrementa  are  not 
taokled  In  a  timely  manner  then  there  would  be  the  possibility  of  Increasing 
proliferation  uf  It  kb/s  spaeoh  coding  standards  lied  tu  Npeulfic 
applications,  Urgsnl  action  In  required  fur  applications  111  and  111)  In  Hie 
table. 

The  European  Telecommuniuatluna  Administrations  are  In  the  process  or 
planning  a  common  digital  mobile  radio  syatem  which  will  be  launched  In 
llM/lit?,  oppT  Working  Group  ilSM  (Croup  kpeclal  Mobile |  which  was  set  up  In 
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J.982  to  coordinate  Studies  and  activities  <co  vering  aspects  of  speech 
quality,  transmission  delay,  and  complexity)  has  recenly  selected  (10)  a 
speech  coding  algorithm  which  is  of  the  linear  predictive  coding  type  at  13 
kb/s  rata  using  Regular  Pulse  Excitation  and  Long-Term  Prediction!  LPC  (RPE- 
LTP) , 

INMARSAT  is  planning  to  introduce  as  now  maritime  satellite  communication 
system  from  1990  onwards  which  will  provide  users  with  high  quality 
communication  links  even  under  adverse  propogation  conditions.  INMARSAT  is 
proposing  «  16  kb/s  Adaptive  Predictive  Coding  (APC)  algorithm  which  will 
meat  tha  requirements  shown  in  table  11. 

In  addition  to  tha  various  16  kb/a  codac  applications  which  requiro 
urgent  CCITT  action,  aavaral  opinions  have  been  exproeaed  that  CC1TT  should 
alao  undertaka  early  activitiea  on  apaech  coding  at  around  8  kb/s  in  order  to 
anticipate  the  likely  development  of  autononua  standards  in  the  near  future 
such  as  the  use  of  8-9.6  kb/a  for  speaoli  coding  in  DMR  in  order  to  effect 
better  spectrum  utilization. 

Tha  CCITT  haa  just  sat  up  an  Expert  Group  to  establish  whether  it  is 
possible  to  select  a  unique  ooding  algorithm  approach  that  meets  requirements 
of  tha  various  network  applications.  Activitiea  in  this  direction  will  likely 
develop  in  the  next  two  years  with  tha  aim  of  minimizing  the  number  of 
alternative  coding  techniques  to  ba  chosen  aa  CCITT  standards  in  next  Study 
Period  (1.9(18-1992)  . 

CCITT  has  also  initiated  atudiaa  for  the  Study  Period  1988-92  regarding 
"Speech  Packetization",  "Encoding  for  stored  digitized  voice",  and  "  apaech 
analyaia/aynthaale  techniques" . 

Packetizad  apaech  may  find  applications  both  for  ehortbarm 
implementations,  suoh  aa  DCME  (11)  and  for  longer  term  applications,  l.e.,  in 
tha  evolving  broad-band  ISDN  whan  tha  "asynchronous  transfer  mode"  (ATM)  of 
operation  will  ba  implemented  (8,12).  DCM  applications  arn  related  to  the  use 
of  digital  linka  at  apaeda  on  the  order  of  few  Mbita/s,  while  ISDN-ATM 
applications  era  foreseen  at  much  higher  link  apeada  (i. a., 90-150  Mbita/a) . 

Among  tha  problems  to  ba  studied  tha  following  items  may  ba  mentioned! 

-  Interfaces  (1536/1984  kb/a) 

-  Speech  ooding  algorithms  (PCM,  ADPCM) 

-  Voice-band  data 

-  Error  detection 

-  Voice  delay 

-  Performance  (packet  loss  and  bit  dropping) . 

The  CCITT  work  (Question  29 / XV III)  on  "Encoding  for  stored  digitized 
voice"  aaaumaa  that  tha  transmission  of  voice  massage  among  store-and-forward 
systems  ia  in  line  with  the  message  handling  system  (MH8)  procedures 
specified  In  CCITT  Rea.  X.  400  to  X . 4 20 .  It.  is  also  accepted  ttiat  algorithms 
developed  under  "16  kb/a  speech  coding"  could  be  used  even  for  the  encoding 
of  the  stored  voice,  especially  when  essooiated  with  a  suitable  alienee 
encoding. 

Tha  general  requirements  tor  the  enuodlng  of  stored  voice  ere  tentatively 
given  by  CCITT  as  follows! 

•  low  bit  rate  poaaibly  using  silence  codingi 

-  hiqh  quality  speech  (equivalent  to  6  to  7  bit  PCM) 

•  epeeker  reoognlzabi 1 ity i 

variable  rate  operation!  l.e.  graanful  degradation  of  voice  quality 
when  the  bit  rate  ia  decraaaad) 

-  robustness  in  multi-speaker  conditions  and  with  typical  basic  ground 
office  noise. 

standards  for  voice  storage  services  ere  liksly  to  cover  bit  rates  from  4 
to  16  kb/s.  The  bandwidth  la  likely  to  be  about  1  kHz,  but  it  ia  too  early 
for  ccitt  to  aettle  on  a  ooding  technique.  It  ia  to  be  noted  that  ooding 
daisy  will  be  amah  leas  of  a  problem  hare  than  asy  in  tha  packstlsad  speech 
with  real-time  conversation*. 

Aa  far  as  question  12/XVItl  on  "speech  analyaia/aynthaula  techniques"  is 
oonuernad  thare  haa  nut.  bsen  much  activity  within  CCITT  Marking  Party  xvni/s 
even  though  many  member  countries  heva  bean  very  autive  in  this  (laid  with 
encouraging  results.  The  only  contribution  reaching  CCITT  satma  to  Have  come 
from  INN AM AT  who  reported  on  activities  undertaken  outside  Of  ITT  to  piooeeri 
towards  4.1/t.l  kb/s  encoding  standards  by  tho  AMC  (Airline  Nleutronlc 
■nglnosring  Cummittoal  for  telephony  applications  from  commercial  airplanes. 
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4,  NATO  STANDARDISATION  ACTIVITIES  INSPBBCH  PROCESSING 

Like  the  National  Security  Agancy  in  the  USA  and  similar  agencies  in  tho 
othar  countries,  NATO  also  has  not  waited  for  international  agreements  and 
has  aat  standards  for  voice  coding  at  rates  ranging  from  2.4  kb/s  to  16  kb/a. 

4.1.  NATO  STANAQ  4198 


The  NATO  standardisation  Agreement  (STANAG)  4198  which  wae  promulgated  on 
13  February  1984  by  the  NATO  Military  Agency  Standardisation  (MAS)  defines 
the  voice  digitizer  characteristics!  the  coding  tables  and  the  bit  format 
requirements  to  ensure  the  compatibility  of  digital  voice  produced  using  2400 
b/s  Linear  Pradiative  Coding  (LPC) . 


The  content  of  this  agreement  is  outlined  bslowi  as  an  indication  of  what 
needs  to  be  specified  in  ordsr  to  assure  interoperability  between  equipments 
manufactured  by  different  nations. 


s)  Description  of  Linear  Pradictive  Coding 


Piga  1  and  2  giva  the  block  diagrama  of  tha  transmitter  and  raesivar 
portions  of  s  typical  LPC  aystam. 

i)  Tha  input  bandwidth  must  be  as  wide  as  possible,  consistent  with  s 
sampling  rats  of  B  kHz.  It  is  desirable  that  tha  pass  band  be  flat 
within  3  dB  from  100  to  3600  Ha. 

ii)  Aftar  first  order  prs-amphasia  (1-9375S-1)  10  pradictor  coaffioiants 
ara  determined  by  linear  pradictive  analysis. 

iii)  For  pitch  end  voicing  analysis!  60  pitch  values  ere  calculated  over 
the  frequency  range  of  80  bo  400  Ha.  A  two-atats  voicing  decision  ia 
mads  twioa  par  22. 5  milliseconds  frame. 

iv)  The  excitation  and  spectrum  parameters  ere  then  coded  and  error 
corrected  for  transmission  at  2400  b/e. 


b)  Votes  Digitiser  Characteristics, 


sampling  Hate 
Predictor  Order 
Transmission  Data  Rate 
Frame  Length 


Pitch 
Voicing 
Amp) Itude 


flg.iBEKm,Anily»l# 

Pre-emphasis 

Spectrum  Approximation 
Jpeotrum  Coding 


8  kHs  *  .it 
10 

2400  b/s  t  ,011 

22 . 5  me  (84  bits  per  frame) 


50-400  Hi i  aeml- logarithmic  coding 
(to  values) 

A  two-ateta  voicing  decision  la  made  twice 
a  frame 

Speech  root-mean-equare  (rme)  value,  eemt- 
logarithmic  coding  (32  values) 


Typical  first  order  digital  trenofer 
function  1  -  .»J75irr 
10th  order  ell-pole  filter 
Log  area  ratio  for  the  first  two 
coefficients  and  linear  reflection 
coefficient*  for  the  remainder 


Trtnsinimlon  Data  format 

Synchronisation 
Pi toh/ Voicing 
Amplitude 

Reflection  Coefficients 


1  bit 
7  bits 
5  hits 


41  bite  tor  10  coefficients  if  vetoed,  or 
to  bite  for  4  coefficient*  with  20  error 
protection  bice  if  unvoioed 


lam  fijiiailm  ind  ftiuxiallan 


Voicing  Decision  (1)  Full-frame  unvoiced  decision  encoded 

ae  •  7-bit  word  having  e  Hemming  weight 
of  aero  IT  seres ) 

(I)  Half-frame  voicing  transition  encoded 
•e  •  7-bit  word  having  a  Hamming  weight 

of  seven  |7  cnee) 
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Unvoiced  Frame  Parameters  Hamming  (8,4)  codes  to  protect 

most  significant  bits  of  amplitude 
information  and  first  4  reflection 
coefficients . 


Voiced  Frame  Parameters  (1)  60  pitch  values  mapped  into  60  of  70 

possible  7-bit  words  having  a  Hamming 
weight  of  3  or  4 

(2)  Typically  for  good  performance  under 

error  conditions  an  adaptive  smoothing 

algorithm  should  be  applied  to  pitch, 

amplitude  and  first  4  reflection 
:oef f icienta  for  eradication  of  gross 
errors  based  on  the  respective 
parameter  values  over  three  consecutive 
frames 


Synthesis 


The  synthesis  filter  must  be  a  10th  order  all-pole  filter  with 
appropriate  excitation  signals  for  voicsd  and  unvoiced  sounds  capable 
of  satisfying  the  speech  intelligibility  requirements  as  specified  in 
Section  (d)  below. 

1 


The  typical  de-emphasis  transfer  function  is - . 

1  -  .  75Z”i 

A  recommended  40  sample  all-pass  excitation  for  voiced  speech  is  as 
specified  in  the  STANAG. 


c)  interoperable  coding  and  Decoding 

The  RMS,  reflection  coefficients,  pitch  and  voicing  are  coded  to  2400 
b/s.  The  frame  length  is  22.5  ms.  The  bit  allocation  for  the  voiced 
and  non-voiced  frames,  the  specified  transmitted  bit  stream  for  voiced 
and  non-voiced  frames,  synchronization  pattern,  coding  of  the 
reflection  coefficients  and  the  logarithmic  coding  of  RMS  have  to  be 
specified  as  in  the  tables  given  in  the  Stanag. 

d)  Performance  Characteristics 

i)  The  performance  of  LPC-10  speech  processors  shall  be  measured  in 
terms  of  intelligibility  test  and  free  conversation  test  (14) . 

ii)  The  voice  intelligibility  of  the  voice  processor  shall  be  measured 
using  the  Diagnostic  Rhyme  Test  (DRT-IV) .  For  the  DRT,  English, 
Amoriaan  and  Frenah  versions  are  to  be  used  and  the  talkers  and 
listeners  are  to  be  familiar  with  the  language  in  each  case.  The 
input  analogue  tapes  to  be  used  for  the  English,  American  DRT  and 


the  minimum  acceptable 
independent  contractor, 

Acoustic 

scores,  which  should 
are  given  belowi 

Bit 

Error 

be  obtained 

from  an 

Minimum 

Acceptable 

Environment 

Talkers 

Tapes 

Rate 

Microphone 

Score 

Quiet 

6M 

E-l-A 

E-l-B 

0 

Dynamic 

86 

Office 

3M 

C-4-A 

0 

Dynamic 

84 

Shipboard 

3M 

K6-1.2-A 

0 

H250 

85 

(Saipan) 

Aircraft 

3M 

K7-1.2-A 

0 

EV985 

82 

(P-3CI 

Jeep 

3M 

KB-1.2-A 

0 

H250 

82 

Tank 

3M 

K9-1.2-A 

0 

EV985 

82 

Quiet 

3M 

E-l-A 

2.01 

Dynamic 

82 

E3A 

3M 

K1-11A 

0 

215-330 

82 

F15 

3M 

K-10-1 

0 

MIDI 

75 

F16 

3M 

lC-11-1 

0 

M101 

75 

TORNADO  F-Z 

iii)  The  Free  Conversation  Test  shall  be  carried  out  using  at  least  6 
pairs  of  subjects  who  shall  have  no  undue  difficulty  in  conversing 
over  a  normal  telephone  circuit.  A  mean  opinion  score  of  at  least 
2.5  shall  be  obtained  when  the  speech  is  transmitted  between 
typical  office  environments  and  with  zero  bit  error  rate. 

a)  STANAG's  Related  to  STANAG  4198 


Thera  are  NATO  STANAG's  which  specify  the  modulation  and  coding 
characteristics  that  must  be  common  to  assure  interoperability  of  2400 
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b/s  linear  predictive  encoded  digital  speech  transmitted  over  HF  radio 
facilities  (STANAG  4197,  promulgated  on  2  April  1984) ,  and  on  4-wlre 
and  2-wire  circuits  (STANAG  4291,  promulgated  on  21  August  1986) 

4.2.  4600  b/s  Voice  Coding  Standard 

There  is  a  US  Proposed  Federal-Standard  (PFS-1016) ,  to  be  considered  also 
by  the  US  Military  and  NATO,  for  a  4800  b/s  voice  coder  which  is  claimed  to 
outperform  all  US  government  standard  ooders  operating  at  rates  below  16  kb/s 
and  even  to  have  comparable  performance  to  32  kb/s  CVSD  and  to  be  robust  in 
acoustic  noise,  chahnel  errors,  and  tandem  coding  conditions  (15) . 

PFS-1016  is  embedded  in  the  US  proposed  Land  Mobile  Radio  standard  (PFS- 
1024)  that  include  signalling  and  forward  error  correction  to  form  an  8  kb/s 
system.  A  real-time  implementation  of  a  6400  b/s  system  with  an  embedded  PFS- 
1016  is  said  to  be  submitted  for  consideration  in  INMARSAT'S  standard  system. 

The  coder,  j<  '.ntly  developed  by  the  US  DOD  and  ATT  Bell  Laboratories, 
uses  a  code  excited  predictive  (CGLP)  coding  which  is  a  frame-oriented 
technique  that  breaks  a  sampled  input  signal  into  blocks  of  samples  (i.e., 
vectors)  which  are  processed  as  one  unit.  CELP  is  based  on 
analysis-by-synthesis  search  procedures,  two-stage  perceptually  weighted 
vector  quantization  (VO)  and  linear  prediction.  A  10th  order  linear 
prediction  filter  is  used  to  model  the  speech  signal's  short-term  spectrum 
and  is  commonly  referred  to  as  a  spectrum  predictor.  Long-term  signal 
periodicity  is  modeled  by  an  adaptive  code  book  VQ  (also  called  pitch  vq 
becauae  it  often  follows  the  speaker's  pitch  in  voiced  speech).  The  residual 
from  the  spectrum  prediction  and  pitch  VQ  is  vector  quantized  ueing  a  fixed 
stochastic  code  book.  The  optimal  acaled  excitation  vectors  from  the  adaptive 
and  stochastic  code  books  are  selected  by  minimizing  a  time  varying, 
perceptually  weighted  distortion  measura.  Tha  perceptual  weighting  function 
improves  subjective  speech  quality  by  exploiting  masking  properties  of  human 
hearing. 

4.3.  NATO  Draft  STANAG  4380 

The  draft  STANAG  4380  hae  bean  prepared  by  the  "Subgroup  on  Tactical 
communications  Equipment"  of  tha  "Tri-service  Group  on  Communications  and 
Electronic  Equipment"  (TSGEE)  end  has  been  forwarded  to  the  Major  NATO 
Commanders  for  review/comment  and  to  the  Nationa  for  ratification.  Tha  STANAG 
deals  with  Technical  Standards  for  Analogue-Digital  Conversion  of  Voice 
Signele  using  16  kb/e  delta  modulation  and  syllabic  companding  controlled  by 
3-blt  logic  (CVSD).  A  block  diagram  of  tha  coder/decoder  is  shown  in  Fig  3. 

Tha  following  information  is  given  as  an  indication  of  the  standards  that 
are  required  to  ensure  interoperability,  where  and  when  required,  of  16  kb/s 
digital  voice  signals  for  tactical  communications. 

a)  Frequency  Response 

The  input  and  output  filters  shall  have  a  a  passband  of  at  least 
300  Hz  to  2.7  kHz. 

b)  Modulation  level 

when  an  800  Hz  sinewave  signal  at  0  dBmO  is  applied  to  the  input  of 
the  coder  (point  A  in  Fig  3) ,  the  duty  cycle  at  the  output  of  the 
modulation  level  analyser  (point  C)  shall  be  0.5  (The  duty  cycle  is 
the  mean  proportion  of  binary  digits  at  point  C,  each  one  indicating  a 
run  of  three  consecutive  bitB  of  the  same  polarity  at  point  B) . 

c)  Companding 

In  both  the  coder  and  tha  decoder  the  maximum  quantising  step,  which 
drives  the  principal  integrator  at  point  D,  shall  have  an  essentially 
linear  relationship  to  the  duty  cycler  the  ratio  of  maximum  to  ipinimum 
quantising  steps  et  the  decoder  output  (point  F)  shall  be  34  db-2  db. 

with  the  decoder  output  (point  B)  connected  to  the  decoder  input 
(point  B'),  whan  an  800  Hz  ainawave  at  tha  coder  input  (point  A)  is 
changed  suddenly  from  -42  dBmO  to  0  dBmO,  the  decoder  output  signal 
(point  F)  shall  reach  its  final  value  within  2  to  4  ma. 

d)  Distortion  and  Noise 

When  a  sinewave  signel  at  -20  dBmO  is  applied  to  the  coder  input  (A) , 
the  attenuation  distortion  st  ths  dsoodsr  output  (F) ,  rslstivs  to  that 
at  800  Ht,  ahall  ba  within  ths  limits  that  are  specified  in  the 
STANAG)  the  distortion  contributed  by  ths  coder  alone,  measured  at  tha 
output  of  the  principal  integrator  (El ,  is  also  specified. 
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The  idle  channel  noise  at  the  output  of  the  decoder  <F)  shall  not 
exceed  -45  dBmO  and  the  level  of  any  single  frequency  in  the  range  300 
Hz-  8  kHz  shall  not  exceed  -50  dBmO . 

The  limits  for  the  signal/noise  ratio  at  the  output  of  the  decoder  (F ) 
are  also  given  in  the  draft  STANAG  4380. 

5.  CONCLUSIONS 

In  the  past,  standards  used  to  follow  technology.  Manufacturers 
dominated  standards  activities.  Contentions  ware  avoided  by  adoption  of 
multiple  options  in  the  standards.  Interoperability  of  products  was  not  an 
issue.  The  marketplace  was  not  sensitive  to  the  speed  of  standards  processes. 

Today,  users  have  a  large  presence  in  standards  development.  They  demand 
interoperability  of  products,  and  they  reject  nonresolution  of  contentions. 
Standards  processes  are  protracted  and,  in  sharp  aontrast  with  the  past, 
standards  now  lead  technology. 

Standards  affect  customers,  service  providers,  equipment  manufacturers, 
and  vendors  and  the  growing  importance  of  standards  is  now  becoming 
universally  recognized.  In  the  future,  this  importance  will  be  driven  even 
higher  by  the  increasing  complexity  of  the  technology  and  the  rising 
expectations  of  the  world's  growing  population.  And,  as  a  result  of  this 
recognition,  a  number  of  major  trends  are  developing  within  the 
telecommunication  industry.  These  aret 

*  Consumers  and  users  in  the  industry  are  increasingly  demanding 
compliance  with  standards. 

*  Standards  activities  are  growing  at  a  rapid  rate. 

*  There  is  a  growing  sense  of  urgenay  in  standards. 

These  major  trends  oan  only  be  satisfied  through  increasing  the 
responsiveness  by  standards  bodies.  Standards  bodies  must  be  sensitive  and 
responsive,  to  the  growing  needs  for  both  timely  delivery  of  aocepted 
standards  and  conservation  of  global  resources  in  doing  standards  work. 

Today,  standarda-making  organizations  are  mainly  bottom-up  driven. 
Direction  la  charted  by  the  individual  contributions  submitted  by  members. 
Studies  are  started  by  proposals  from  members,  they  are  aupported  by  membar 
contributiona  and  they  are  completed  when  contributions  caasa  and  a  consensus 
is  reached  between  the  members.  This  epproach  is  acceptable  and  nacaaaary 
but  should  be  complemented  by  some  top-down  drive  for  timeliness  and 
rasponaiveneas  to  whatever  the  requirements  ara. 
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Fig.  I  Linear  predictive  coder  transmitter  (typical) 


Fig.2  Linear  predictive  coder  synthesizer  (typical) 
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Fig. 3  Block  diagram  of  coder  and  decoder 
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SONHARX 

Inoreaaes  In  th«  functional  oapabilitiaa  of  military  systems  have  made  thaaa 
systems  inaraaalngly  more  difficult  to  opapata.  Inoraaaad  operator  workload  in 
modern  workatationo  and  airoraft  hava  produoad  operator  atreaa  and  fatigue,  resulting 
in  degraded  operator  performance,  especially  in  time  orltioal  tasks.  One  reason  for 
this  problem  la  that  both  data  entry  and  system  oontrol  functions  are  often  controlled 
via  the  systems  keyboard.  In  some  systems  funotlons  are  nested  many  layers  deep 
making  the  ayatem  inefficient  and  dlffloult  to  use.  For  this  reason  RADC  has  been 
developing  teohnology  to  Improve  the  interface  between  the  Air  Force  system  and  its 
operator.  Many  efforts  and  several  technologies  are  being  pursued  in  speeoh 
recognition  and  synthesis,  multimodal  interface  techniques,  and  voioe  interactive 
oonaepta  and  methods.  This  work  is  being  conducted  to  satisfy  the  Air  Foroe 
requirements  for  modern  oommunloatlon  stations  and  the  FORECAST  II.  Battle  Management 
and  Super  Cockpit  Programs. 


1 ,  INTRODUCTION 

Interest  in  potential  uses  of  Automat io  Speeoh  Recognition  (ASR)  teohnology  is 
steadily  increasing  in  both  military  and  alvillan  communities.  Muoh  of  this  Interest 
la  due  to  advanoes  in  eleotronlos  and  computers  rather  than  in  new  techniques  for 
speech  recognition.  Despite  its  current  limitations,  ASR  promises  to  aid  in  a  variety 
of  military  applications  by  increasing  the  effectiveness  and  efficiency  of  the  man- 
maohlne  interface.  Indeed,  military  organisations  have  long  been,  and  oontlnue  to  be, 
one  of  the  main  sources  of  support  of  reaearoh  and  development  of  ASR  teohnology. 

This  Paper  examines  some  reaent  applications  of  ASR  teohnology.  It  is  not 
Intended  to  bo  exhaustive  but  rather  presents  a  representative  perspective  of  the 
military  uses  of  this  teohnology.  Four  major  categories  or  applications  will  be 
disoussedi  Audio  Signal  Analysis,  Voice  Input  for  Command  and  Control,  Message 
Sorting  by  Voice,  and  Speeoh  Understanding/Natural  Language  Prooesslng  for  the  DOD 
Qlster  Program. 


2.  AUDIO  SIGNAL  ANALXSIS 

RADC  has  been  developing  speeoh  enhancement  teohnology  to  improve  the  quality, 
readability  and  intelligibility  of  speeoh  signals  that  are  masked  and  interfered  with 
by  oommunloatlon  channel  noise.  RADC's  interest  in  speeoh  enhancement  is  not  only  in 
improving  the  quality,  readability  and  Intelligibility  of  speeoh  signals  for  human 
listening  and  understanding  but  to  improve  speeoh  signals  for  maohlne  prooesslng  as 
well.  Speeoh  teohnology  such  as  speaker  identification,  language  recognition, 
narrowband  communications,  and  word  recognition  being  developed  by  RADC  requires  good 
quality  signala  in  order  to  provide  effeotlve  results.  The  development  of  automatic 
real-time  speech  enhanoement  teohnology  is  therefore  of  high  Interest  to  RADC. 

There  are  a  large  number  of  Air  Foroe  applioatlons  for  speeoh  enhanoement  that  are 
being  addressed  by  RADC.  Many  AF  systems  that  perform  silence  or  gap  removal  and/or 
speech  compression  have  difficulty  with  the  prooesslng  of  noisy  oommunloatlons  data. 
In  many  lnstanoes  gap  removal  is  completely  ineffective  and  compression  aohemes 
completely  degrade  speaker  identity  and  oausa  large  reductions  in  intelligibility. 
These  systems  require  speeoh  enhanoement  to  be  operationally  effeotlve.  RADC  is  also 
addressing  the  use  of  Automatlo  Speeoh  Recognition  (ASR)  in  noisy  environments  such  as 
in  the  Foreoast  II  Super  Cookplt  Program.  Although  there  has  been  some  success  in 
using  restricted  and  well  structured  ASR  in  the  oookpit,  difficulties  with  aooustlo 
noise  in  the  airborne  environment  is  muoh  sore  troublesome  for  larger  vocabulary 
continuous  speeoh  recognition  systems.  The  suooessful  use  of  enhancement  for  ASR  can 
offer  performance  improvements  that  will  make  voioe  oontrol  and  data  entry 
operationally  aooeptable  for  many  airborne  applioatlons. 

Another  area  in  whioh  the  noise  generated  in  an  airoraft  oauses  problems  la  the 
use  of  voooders  for  narrowband  jam  resistant  oommunloatlons.  Voooder  teohnology  use 
is  restricted  in  many  airborne  applioatlons  because  the  aooustlo  noise  generated  by 
the  airoraft  degrades  the  Intelligibility  of  the  voooder  system  to  an  unaooeptable 
level.  Speeoh  enhanoement  to  reduoe  the  airoraft  noise  offers  the  capability  to  make 
a  variety  of  voooder  teohnology  available  for  airborne  use. 
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2.1.  TEE  SPEECH  BEHAN CEMENT  UNIT 

RADC  has  developed  a  Speeoh  Enhancement  Unit  (SEU)  whioh  provides  an  on-line, 
real-time  oapability  to  remove  frequently  encountered  communication  ohannel 
Interferences  with  minimum  degradation  to  the  speech  signals.  The  types  of 
interferences  or  noises  removed  can  be  olaased  into  three  groups  (1)  Impulse  noise 
suoh  as  stetlo  end  Ignition  noise,  (2)  narrowband  noise  which  includes  all  tone-like 
noise,  and  (3)  wideband  random  noise  sqoh  ,aa  atmospheric,  reoeiver  eleobronlo  noises, 
and  alroraft  noise.  The  impulse  noise  removal  prooasa  is  a  time  domain  prooess.  The 
prooees  is  very  effective  for  removing  impulses  up,  to , 20,  mllliseoonds  in  length.  The 
narrowband  noise  removal  prooees  is  a  frequency  domain  prooess  and  is  unique  in  that 
it  oan  remove  both  high  level  and  low  level  tones,  the  tones  of  whioh  there  may  be 
several  hundred,  may  be  fixed  or  moving.  This  oapability  has  been  extremely  useful  in 
removing  power  oonverter  hums  and  hstrodyna  signals  found  on  communication  ohannels. 
The  wideband  noise  removal  prooess  used  ie  a ,  aubtraotive  prooess  that  is  accomplished 
in  the  spaotrum  of  the  square  root  of  the  amplitude  speotrum.  While  this  function  is 
not  the  same  as  the  oepstrua  (the  oepstrum  is  the  speotrum  of  the  log  amplitude 
speotrum),  slnoe  it  resembles  the  oepstrum  it  is  referred  to  as  the  root-oepstrum.  In 
this  method  of  noise  reduotion  the  average  root-oepstrum  of  the  noise  in  the  input 
signal  is  oontlnually  updated  and  subtraoted  from  the  root-oepstrum  of  the  oombined 
speeoh  and  noise.  Beoause  the  random  noise  oonoentrates  disproportionately  more  power 
in  the  low  region  of  the  root-oepstrum  than  does  the  speeoh,  the  subtraoted 
reoonstruoted  time  signal  produoes  an  enhanaed  speeoh  signal.  A  ploture  of  the 
prototype,  the  VLSI,  and  the  VHSIC  enhancement  units  la  shown  in  figure  1. 

2.2  TESTIEO  THE  SEU 

The  SEU  has  been  tested  in  two  areas.  They  ere  (1)  the  reduotion  of  oommunloation 
ohannel  noise  to  improve  the  recognition  performance  of  human  listener  and  (2)  the 
reduotion  of  wideband  random  noise  and  aircraft  cockpit  noise  to  improve  the 
performance  of  automatic  speeoh  reaogniton  (ASR)  systems.  Improvements  in  performance 
have  been  demonstrated  in  other  areas. 

The  first  test  conducted  on  the  SEU  determined  the  effect  of  processing  radio 
frequenoy  voioe  oommunloation  ohannels  containing  a  variety  of  off-the-air  noises  on 
the  monitoring  parformanoe  of  humans.  The  signals  wsre  monitored  by  squally  skilled 
trained  Air  Poroe  operators  both  before  and  after  enhancement  by  the  SEU.  The  data 
was  oontrollad  so  that  no  operator  heard  the  same  data  before  and  after  enhancement. 
The  readability  of  the  signals  was  rated  before  and  after  enhanoement.  The 
readability  of  the  signals  is  as  shown  in  Figure  2.  Note  the  shift  in  the  readability 
of  the  signals  after  enhancement.  The  results  olearly  show  an  improvement  In 
readability.  However,  not  only  was  there  a  significant  Improvement  in  the  readability 
of  the  signals  but  operator  fatigue  was  reduoed,  lntelllglblity  Improved,  and  very 
importantly,  the  enhanoement  prooess  was  found  to  be  oapnble  of  being  operated  In  an 
entirely  automatic  mods.  Alao  important  was  the  unoovering  of  events  that  were  not 
reoognlsad  before  enhanoement.  These  results  appear  to  agree  with  equipment 
laboratory  teste  whioh  showed  the  narrowband  and  Impulse  noise  to  be  attenuated  as 
much  as  40dB.  Measurements  on  the  wideband  removal  prooess  showed  a  slgnal-to-nolse 
ratio  Improvement  of  from  15  to  21dB. 

The  seoond  set  of  tests  were  oonduoted  to  determine  the  effect  of  using  the  SEU  as 
a  preprocessor  to  automatlo  speeoh  recognition  systems.  Several  speeoh  recognizers 
were  used  with  good  results. 

The  results  af  a  test  aonduoted  at  an  Air  Poroe  flight  laboratory  with  the  SEU 
aotlng  as  a  preprocessor  to  an  LPC-baasd  reoognizer  showed  substantial  recognition 
improvements.  The  taste  were  oonduoted  in  a  facility  where  the  aoouatio  environment 
of  the  P-16  oookpit  wee  simulated.  The  teats  were  conducted  using  the  Advanoed 
Fighter  Technology  Integration  (APTI)  36-word  vooabulary.  Training  was  accomplished 
without  the  SEU  and  in  SSdBa  sound  pressure  level  (SPL),  Six  subjeets  were  testedi 
four  military  pilots  and  two  Lear  Siegler  personnel.  The  two  subjects  used  for  the 
enhanoement  tests  wore  the  lowest  scoring  military  pilots  in  the  tests.  Enhanoement 
wee  used  only  during  the  109dBa  and  115dB«  noise  level  tests.  At  109dBa  noise  level 
recognition  parformanoe  increased  from  46?  without  SEU  processing  to  7 5$  after 
enhancement.  Performance  Jumped  from  30)1  to  79)1  after  enhanoement  for  the  IISdBa 
noise  level  condition,  Figure  3. 

Other  tests  using  the  SEU  or  a  preprocessor  have  shown  varying  degrees  of 
improvement.  Test  results  without  training  the  reoognicer  through  the  SEU  show  digit 
recognition  improvements  of  20)1  oorreot  recognition  to  831  after  enhanoement  for  an 
input  S/N  of  3dB  ualng  wideband  random  noise,  Figure  4. 

Better  performance  was  obtained  by  training  the  reoogniser  through  the  SEU  under 
no  noise  conditions.  The  results  obtained  for  this  oondltion  show  an  Improvement  from 
21$  to  100$  oorreot  recognition  after  enhanoement  at  a  10  dB  S/N. 

3.  VOICE  INPUT  FOB  COMMAND  AMD  CONTROL 

Thara  art  many  aan-maohina  lnterfaoa  (MMI)  problama  associated  with  the  modern 
oommunloation  ataelona*  battla  management  workstations  and  tha  advanoed  aircraft 
oookpit  auoh  aa  the  Super  Cockpit.  Several  faotora  have  led  to  the  MMI  problems  and 
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the  subsequent  thrust  by  the  Air  Foroe  to  develop  MHZ  technologies.  They  arei 

o  Adding  on  new  capabilities  to  existing  systems 

o  Kew  systems  with  many  oombined  capabilities 

o  Increased  complexity  of  the  environment 

o  Reduoed  time  to  oomplate  tasks 

o  Increases  in  the  number  of  time  oritloal  tasks 

Many  of  these  faotors  are  the  direot  result  of  reduced  manpower  (accomplish  sore  with 
fewer  operators),  and  the  increased  speed  of  events  oaused  by  higher  speed  airoraft 
and  advanced  weaponry. 

New  ASR  and  speaoh  synthesis  technology  forms  the  basis  for  voioe  Input/output 
(I/O)  systems.  Suoh  systems  can  Improve  man-maohine  interaction  in  Air  Force 
requirements  for  modern  communication  stations  and  the  Foreoast  II  Battle  Management 
and  Super  Cookplt  Programs.  Speeoh  communication  with  maohlnea  oan  offer  advantages 
over  other  modes  of  oommunloation  suoh  as  manual  methods,  especially  when  humans  are 
engaged  in  tasks  requiring  hands  and  eyes  to  be  busy.  Speeoh  offers  tho  most  natural, 
and  potentially  the  most  aocurate  and  fastest  mode  of  oommunloation,  but  is 

susoeptlble  to  environmental  interference,  and  restricted  by  speaker  and  training 
requirements.  Reaearohers  are  ourrantly  investigating  speech  recognition  techniques 
whioh  would  permit  a  more  natural,  continuous  form  of  speaking  style  and  whloh  would 
require  a  minimum  amount  of  training  by  the  speaker. 

The  workload  of  the  military  flight  orew  is  becoming  more  demanding,  due  to 

increases  in  the  amount  of  oomplex  equipment  orew  members  must  monitor  and  control. 
Henoe,  there  are  oonatant  demands  on  orew  members  for  manual,  visual  and  aural 
attention  in  order  to  perform  vital  mission  functions,  suoh  as  navigation,  controlling 
weapons  and  monitoring  sensors.  At  present,  moat  oritloal  funotions  are  performed  via 
manual  operation  of  swltohes  and  keys.  The  inorease  in  the  number  of  manual  tasks,  as 

well  as  Information  prooesalng  demands,  has  made  it  difficult  for  the  orew  member  to 

perform  all  the  neaessary  funotions  while  maintaining  oonbrol  of  the  airoraft.  ASR 
technology  oan  aid  in  relieving  this  Information  and  motor  overload  by  allowing  the 
use  of  voice  to  oontrol  manual  funotions. 

An  airborne  environment  presents  serious  problems  for  any  speeoh  recognition 
device.  These  problems  lnolude  high  ambient  noise,  high  g-foroes,  vibration,  effects 
of  oxygen  masks,  and  extremes  of  altitude,  pressure,  temperature  and  humidity.  Yet 
there  is  no  doubt  that  military  organisations  sea  ASR  technology  as  an  integral  part 
of  future  airborne  oockpit  avionics  if  the  challenge  of  operating  in  the  harsh  air 
environment  oan  be  met. 

Currently  RADC  has  several  efforts  for  the  development  of  MMI  concepts  and  a 
testbed  for  test  and  evaluation  of  those  oonoepts.  The  overall  purpose  of  the 
reaearoh  is  to  determine  the  requirements  to  provide  efficient  Interfaces  for  the 
advanced  Air  Force  cookplts  and  workstations.  The  voioe  lnterfaoe  goals  are  to 
develop  the  rudiments  of  an  overall  philosophy  for  verbal  interaotion  with  these 
systems, 

In  ordor  to  develop  the  philosophy  and  subsequent  techniques,  detailed  scenarios 
for  the  aockplt  and  workstations  are  being  analyzed  in  terms  of  tasks,  workload  types, 
type  and  amount  of  information  to  be  transferred,  time  constraints,  oritloality  of  the 
Information,  and  environmental  conditions.  Using  this  scenario  Information, 
experiments  will  be  oonduoted  to  determine  fundamental  relationships  suoh  asi 

o  the  effeots  of  S/N  in  terms  of  time  and  aoournoy  on  the 
completion  of  an  audio  task  at  various  audio  workloads. 

o  the  effeots  of  various  visual,  manual,  and  oral  workloads  on 
various  audio  (lintenlng)  tasks  and  vice  versa. 

o  the  effeots  of  injecting  audio  messages  (both  voioe  and  sound) 
into  a  aystem  under  various  audio,  visual,  and  manual  workloads. 

Knowing  these  inter-relationships  will  narrow  the  number  of  lnterfaoe  modalities 
for  a  given  task  under  a  given  set  of  conditions  and  allows  an  estimation  of  a 

performance  level.  Based  on  the  results  from  the  experiments,  a  design  for  a  MMI 
testbed  will  be  developed,  and  a  tastbed  fabrloated.  Tests  will  be  oonduoted  using 
the  oommunioatlona  soanarlos. 

A  different  tvpe  of  voioe  oommand  aystem  is  used  to  oontrol  entry  to  secure  areas 
and  oomputar  systems.  There  is  significant  military  Interest  in  the  use  of  automated 
systems  based  on  personal  attributes  (suoh  aa  speeoh)  to  verify  the  identity  of 

lndlviduale  seeking  aooess  to  reatrioted  areas  and  systems  (suoh  as  flight  lines, 

weapon  storage  areas,  classified  record  storage  areas,  oommand  posts,  oomputers, 
workstations,  airoraft,  eto.).  In  this  applioation,  ASR  technology  is 
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eiployed  for  automatics  speaker  varlfloatlon,  which  ldantlflaa  who  la  doing  tha  talking 
rather  than  tha  worda  being  spoken.  Teohniquea  baaed  on  both  amplitude  speotral 
information  and  Linear  Predictive  Coding  (LPC)  have  proved  auoQeaaful. 

In  diaouaalng  tha  aocuraoy  of  apaakar  varlfloatlon  and  other  ASR  systems,  it  is 
important  to  note  the  tradeoffs  that  oan  be  made  whioh  affaot  system's  performance. 
The  two  moat  oommonly  recorded  error  typea  arai  rejection  (a  legitimate  utterance  la 
falaely  rejeoted)  and  aubstltution  (an  lnoorraot  utteranoe  of  falaely  substituted  for 
the  legitimate  utteranoe).  In  evaluating  apaakar  verification  performance,  rejections 
are  oalled  "Type  I”  errora  and  result  whan  an  authorised  user  has  been  incorrectly 
denied  aooesa  to  a  aeoure  area.  Substitutions  are  oalled  "Type  II"  errors  and  are  a 
oonaaquenoe  of  an  Imposter  auooeedlng  in  gaining  access  as  an  authorized  user.  The 
tradeoff  Between  the  two  error  types  are  illustrated  In  Figure  5. 

Moat  ASH  syatema  (Including  speaker  varlfloatlon)  incorporate  a  variable  threshold 
whioh  oan  be  adjusted  to  oontrol  tho  balance  between  error  types.  Lowering  the 
threshold' tightens  the  requirements  for  aooeptanoe  of  an  utteranoe  and  thus  lowers  the 
Type  II  error, :  but  With  an  increase  In  the  Type' I  error.  Also  shown  in  Figure  5.  by 
the  dotteu  curve  la  a  Receiver  'Operating  Charaoteristlo,  >  whioh  la  a  graph  of  the 
overall  recognition  accuracy  aa  a  function  of  threshold. 

Recognition  aoouraoy  may  be  increased  by  threshold  adjustment,  with  however,  a 
penalty  of  additional  substitution  errors  (Type  II  errors). 

In  one' test  of  an  automatic  apaakar  verification  system  intended  for  military  use, 
the  average  Type  1  and  Type  II  error  rates  ware  both  on  the  order  of  one  peroent.  The 
teat  lnoludea  over  100  talkers,  over  a  several  month  test  period  (whioh  Included 
occasions  when  speakers  had  odds  or  other  voloe  ailments),  and  for  an  environment 
with  a  high  aignal-to-nolae  ratio  (SHR).  This  system  was  abla  to  perform  suooessfully 
even  when  several  professional  almioa  attempted  to  Imitate  selected  target  speakers. 
Reoent  results  obtained  in  a  speaker  verification  teat  using  100  male  and  ioo  female 
speakers,  show  a  IS  Type  I  error  for  7300  varlfloatlon  attempts  and  a  0.07S  Type  II 
error  for  28;000  verification  attempts. 

It  le  Important  when  using  ASR  technology  for  military  command  and  oontrol 
applications  that  the  total  system  be  considered,  not  Just  the  voloe  component.  A 
thorough  analysis  of  the  human  Job  tasks  and  a  oomplete  understanding  of  the  system 
and  environment  to  which  ASR  technology  is  Interfaced  are  necessary. 

I.  MBSSAOK  SORTINO/AODXO  MANIPULATION 

Listening  to  radio  broadoaata  is  a  time-consuming,  manpower-intensive  and  tedious 
task  for  military  operators.  This  Is  duo  to  the  high  density  of  received  signals  and 
the  poor  signal  quality,  whioh  amuses  operator  fatigue  and  reduoed  effectiveness.  A 
potential  solution  to  the  problem  is  tha  use  of  ASR  teohnology  to  automate  part  of  the 
listening  process.  There  are  several  recognition  technologies  being  pursued  that 
address  the  message  sorting  and  routing  problem,  these  inolude  speaker  Identification, 
language  recognition  and  keyword  recognition.  In  order  to  satisfy  military 
operational  needs  these  recognition  technologies  must  handle  several  operational 
constraints. 

An  ASR  system  must 

o  Be  context  Independent  for  speaker  and  language  identification 
o  Handle  unoooperatlve  speakers 
o  Be  robust  to  band-limited  and  noisy  ohannels 
o  Handle  dynamlo  channel  oondltions 

o  Operate  on-line  and  In  real-time 

o  Perform  recognition  on  very  short  messages 

4.1  3PIAZSN  AOTHNNTI CATION 

Speaker  authentication  is  one  method  of  message  sorting  that  can  bo  used  to  reduoe 
tha  number  of  signals  a  communications  operator  must  handle.  Suoh  systems  must 
Identify  unknown  talkers  on  multiple  ohannels  in  real  time  using  a  small  sample  of 
their  speeoh  and  under  the  above  operational  constraints.  Tho  operator  can  specify 
those  talkers  who  are  of  interest  at  a  particular  time,  and  the  system  will  route  to 
the  operator  only  speeoh  that  It  Identifies  as  spoken  by  the  speoifled  talkers. 

Prior  to  executing  a  reoognltlon  task,  a  speaker  authentication  system  is  trained 
using  one  to  two  minutes  of  speeoh  from  eaoh  of  the  talkers  who  nay  later  be 
recognized.  The  major  specification  for  the  system  Is  that  It  shall  oorrsotly 
identify  apeakera  whose  data  have  been  processed  using  as  little  as  two  to  five 
saoonds  of  their  speeoh. 

ItABC  has  developed  a  Speaker  Authentication  System  (SAS)  that  uses  two  techniques, 
a  multiple  pnrAheter  algorithm  using  the  Mahalanobls  Netrio  and  an  Identification 
teohniqud  based  on  a  continuous  speeoh  recognition  (CSX)  algorithm.  The  multiple 
parnmdte'P  algorithm  uses  both  speeoh  and  non-speeoh  frames.  The  speeoh  frames  are 
Used  to  characterise  the  talker  for  reoognltlon,  and  the  non-speeoh  frames  to  deteot 
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possible  ohangea  In  talkers. 

Recognition  la  performed  by  comparing  the  current  average  parameter  vector  with 
each  of  the  aotive  speaker  models.  Onoe  per  second  the  identity  of  the  three  models 
that  are  closest  to  the  speech  being  reoognized  are  output  with  their  corresponding 
soores.  Each  second,  the  frames  from  the  last  second  are  accumulated  and  added  to  the 
average.  The  distance  Is  then  aoroputed  using  the  Mahalanobis  metric. 

The  recognition  module  also  monitors  non-speeah  frames  to  deteat  pauses  in  ths 
input  speeoh  that  are  associated  with  possible  ohanges  in  talkers.  When  non-apaeoh 
frames  ere  input,  the  recognition  module  ignores  the  frame,  but  inorements  the 
allsnoe-frsmea-in-a-row  counter.  If  the  silanee-frames-in-e-row  counter  exoeeda  a 
silenoe  threshold  (user  aeleotable,  default  value  of  0.5  aeoonds),  tho  recognition 
module  signals  a  possible  ohange  in  talker. 

A  second  approaoh  uses  small  sub-word  templates  to  model  a  person's  voiae 
ohsraoteriatlos,  rather  than  ths  long  term  spsotral  atatiatlos  that  are  used  in  the 
multi-parameter  technique.  Ths  test  results,  using  a  CSR  speeoh  recognition  system, 
show  a  very  significant  improvement  in  recognition  accuracy  over  the  first  approaoh. 
Ths  recognition  aoouraoy  sxoeedsd  95)1  for  olsan  speeoh  segments  of  2  seconds  or  longer 
duration,  as  oompared  to  7 5$  for  ths  multi-parameter  technique.  Ths  CSR  based 
algorithm  was  also  tasted  on  noisy  speeoh  and  again  show  improvements  in  performance 
over  ths  results  of  the  multi-parameter  algorithm. 

4.2  AUDIO  MANAQIMtNT 

Inoreaaes  in  the  functional  capabilities  of  modern  workstations  have  made  them 
Increasingly  more  diffioult  to  manage  and  operate.  Inoreaaed  operator  workload  has 
produoed  operator  atresa  and  fatigue,  whioh  has  resulted  in  degraded  operator 
performance,  especially  in  time  oritical  teaks.  RADC  continues  to  investigate  end 
develop  methods  for  audio  handling,  routing,  and  prioritization. 

RADC  developed  an  Advanoed  Speeoh  Processing  Station  (ASPS)  in  tho  lets  1970's. 
The  oonoept  of  the  ASPS  was  to  alleviate  the  problems  assooiated  with  analog  recording 
methoda  by  utilizing  digital  teohniquaa.  These  techniques  were  tho  first  tu  allow  an 
operator  to  pl&ybaok  pre-rsoordsd  spaeoh  while  still  rsoordlng  inooming  speech. 
Utilizing  a  two  minute  buffer,  digital  techniques  allowed  the  operator  to  manipulate 
the  audio  signal  in  the  following  waysi  Jump  baokwarda  or  forwards,  speed-up  or  slow¬ 
down  while  retaining  frequency  information,  repeat  or  loop  speech  segments,  tag 
spaeoh  for  Instant  recall  and  remove  allenoe  or  non  speeoh  gaps. 

The  new  system,  Galled  the  Minl-ASPS,  improved  both  the  audio  and  text 
capabilities  provided  better  operator  interfacing,  and  reduced  workstation  alza, 
weight  an  oost.  Tests  on  ths  Minl-ASPS  have  demonstrated  improved 

psrformanou/produotlvlty  (speed  and  eoouraay),  reduoed  operator  fatigue  end  improved 
comprehension  of  the  audio  data. 

Beoause  of  the  auooeea  of  thaae  techniques,  modern  workstations  containing  many  of 
the  capabilities  are  oommeroially  available.  Audio  manipulation  capabilities  art  also 
available  in  a  stand  alone  unit,  Figure  6,  7. 

5.  DOD/RADC  0I3TKS  PROGRAM 

RADC  in  oonjunotion  with  DOD/DARPA  has  begun  a  three  year  research  and  development 
program  to  automatically,  in  real-time,  "glat"  voloe  traffic  for  the  updating  of 
databasaa  to  produoe  in-time  reports.  If  ths  program  is  suoosssful  it  should 
algnlfloantly  lnorsass  the  ability  to  collect  and  prooess  large  amounts  of  voioa 
traffic  and  reduas  th#  data  to  its  moat  meaningful  kernel,  l.e.  "gist". 

However,  to  develop  a  elating  technology  requires  advanced  capabilities  in  ths 
following  aressi 

(s)  Continuous  Spstoh  Recognition 

(b)  Keyword  Recognition 

(oj  Speaker  Identification 

(d)  Speaker  Adaptatlon/Normalizatlon 

(e)  Natural  Language  Prooesaing 

(f)  Speeoh  Undsratanding/Artiflolal  Intelligence 

(g)  Noise  Reduotion  Techniques 

This  teohnology  is  being  applied  to  air  traffio  control  voloe  oommunloations. 
Presently  DOD/DARPA  and  RADC  are  oolleotlng  both  a  training  and  test  data  baas  using 
digitally  recorded  live  sir  traffio  oontrol  voioa  communications. 

Ths  goal  of  the  program  ia  to  extraot  information  from  the  communication  that 
ttkea  plaoe  between  the  elroreft  and  the  oontrol  tower.  The  system  will  be  oepabls  of 
producing  a  gist  of  ths  dialog  and  will  oomplls  ths  information  about  the  transactions 
and  aotivltlea  that  ooourrsd.  Some  of  ths  capabilities  are* 

(a)  Separata  ths  speeoh  between  pilots  end  controllers 
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(b)  Determine  the  airline  and  flight  number 
(o)  Identify  both  the  pilot  and  controller 

(d)  Determine  the  aotlvity  underway  suoh  as  taksoff,  landing,  etc. 

A  final  goal  of  the  program  la  to  develop  a  real-time  teatbed  system  to  perform 
the  extensive  testing  neoeasary  to  assess  the  current  technology  as  well  as  provide 
future  dlreotion  for  reaearoh  and  development  to  address  military  field  operations, 

6.  rtrroi*  direction 

RADC  will  continue  to  support  the  development  of  speech  rfcoessing  technologies 
for  crltloal  Air  Force  C3X  applications.  A  sound  technology  base  is  established 
through  contractual  and  in-house  reaearoh  work.  Although  several  technologies  look 
promising  for  an  advanced  speech  processing  workstation,  they  oannot  support  Air  Force 
operations  in  many  applications.  However,  the  use  of  these  technologies  in 
combination  offers  a  potential  solution.  RADC  is  currently  pureuing  a  combined  and 
intermotive  approach  using  ASH  teohnologies,  and  is  acting  as  tha  system  integrator 
for  the  advanced  speech  processing  workstation.  In  order  to  provide  tnese  apeooh 
prooessing  capabilities  to  the  field  for  test,  evaluation  and  operation,  an  lnorease 
in  processing  power  la  required.  Additionally,  the  slse,  weight,  power  and  oost  must 
be  reduoed.  Therefore,  RADC  is  Involved  in  the  development  of  a  VH3IC  spaeoh 
processor  that  oan  provide  the  processing  power  to  support  multiple  speeoh  functions 
and  ohannels. 
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14.\  Abstract 


Following  an  explanation  and  discussion  of  the  importance  of  voice  communications  for  military 
operations,  including  the  environmental  and  propagation  effects  and  ECM,  the  Lectures  will 
OUtlUiC/J 

/‘'-^pcech  coding  which  is  mainly  concerned  with  man-to-man  voice  communication  J 
q.  /  —toiecch  synthesis  which  deals  with  machine-to-man  communication  i 
y^'-'V^Hcech  recognition  which  is  related  to  man-to-machine  communicatioQ, 

-^All  these  are  techniques  which  involve  speech  compression  or  speech  coding  at  low-bit  rates  and 
are  needed  for  transmitting  speech  messages  with  a  high  level  of  security  and  reliability  over  low 
data-rate  channels  and  for  other  applications  such  as  memory-efficient  systems  for  voice  storage 
and^esponse, 


amt(ej 
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The  themes  above  will  be  underpinned  by  a  lecture  on  the  nature  of  the  speech  signal  (production, 
recognition  and  perception)  and  complemented  by  other  lectures  on  quality  assessment  of  speech 
systems  and  standards  which  are  crucial  for  the  satisfactory  deployment  of  speech  systems* 

£his  Lecture  Series,  sponsored  by  the  Avionics  Panel  of  AGARD,  has  been  implemented  by  the 
Consultant  and  Exchange  Programme _  _ 
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speech  recognition  v.  hich  is  related  to  man-to-machine  communication.  —  speech  recognition  which  is  related  to  man-to-machine  conuni 
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